Buying Layer 2 Ethernet Services Second of two articles
This month is the second of two articles. The first article covered things to think about when buying Layer 3 MPLS VPN services. It can be found at http://www.netcraftsmen.net/welcher/papers/mplsvpn13.html. This article provides similar discussion for Layer 2 services, which may or may not be MPLS-based. The word "VPN" seems a bit unnecessary here, so I dropped it. While MPLS can be used to deliver Frame Relay, ATM, and other Layer 2 services, those should resemble existing switch-based services. So the focus of this article will be on buying Layer 2 Ethernet services.
This article updates an article from January of 2004, which can be found at http://www.netcraftsmen.net/ welcher/papers/metroethOl.html. A lot of that article is still valid. In particular, it contains some useful acronyms, including a couple I invented. What I see as changed since that article is that there is a shift in Cisco's emphasis from optically-based service to Ethernet services based on MPLS transport, possibly with switched networks as Service Provider Points of Presence. Vendors other than Cisco are pushing whatever delivery approach matches what they have to sell (but that's not news). And Service Providers are starting to get more aggressive about Metro and National scale Ethernet Networks.
To avoid any puzzlement, if I have to use an acronym for Metro Ethernet Networks, it'll be METN not MEN or ME. (The Metro Ethernet Forum uses "MEN"). Based on that, you can probably figure out NETN. I'll use METN-SP for Metro Ethernet Network Service Provider. But enough about acronyms!
Ethernet service architectures
I'm aware of three overall architectures that service providers can use to deliver Metro Ethernet services. (If you know of another, let me know!) They may use a combination of these approaches, either due to investment in hardware and learning about strengths and weaknesses, or for scalability. The three approaches are:
* Optical-Based METN
* (Ethernet) Switch-Based METN
* MPLS-Based METN
We'll look at these in turn. I'll mention vendors, where I know of them, and discuss pros/cons and things to be aware of for each of these architectures.
Optical based METN is based ultimately on SONET or DWDM gear. The basic idea is in effect TDM circuits at approximately Ethernet speeds (e.g. OC-192 is about 10 Gbps). Note that I am assuming the edge device is not multiplexing multiple customers onto one shared circuit. That sort of service might for example be provided by combining edge switches with DWDM between switches, all in an optical edge device. I would classify such a service as a switch-based service, see below.
The good news about this approach is that it is fairly simple and static. It's your pipe, dedicated to you, so you don't have to worry about sharing bandwidth. QoS is simpler. Furthermore, provisioning is fairly simple, and may resemble the "classical" SP model. So your circuit will probably be ok unless somebody messes up while provisioning somebody else's new circuit. The simplicity may also pay benefits in terms of stability, Mean Time to Repair when there's a problem, and so on.
Another bit of good news is that each customer has a "dedicated pipe." So if somebody has a bad day and suffers a spanning tree loop, the excess traffic should not affect other users of the optically-based service.
On the downside, because the provider cannot oversubscribe trunks, this approach is arguably a bit wasteful of bandwidth. That may end up costing you more. In addition, DWDM gear isn't cheap. At the very least, you're getting an access port tied directly to DWDM gear, whereas with other approaches, the DWDM gear is either on the SP side of an edge device, or not present. IfDWDM or SONET gear is needed at your premises to drive the connection to the SP, then your cost can be considerably higher. Increasing speeds may require a costly hardware swap-out.
Also on the downside, optical provisioning software can be a bit ugly right now. There is motion towards GMPLS for control, but that whole area is immature.
Ethernet switches can be used to deliver METN services. I include optical devices where the edge access device provides Ethernet switching in this category. One advantage of this approach is that lower-cost 1 and 10 Gbps ports can be used to interconnect the switches (unless the edge device has a DWDM or other optical provider-side to it). Long-haul GBIC or SFP modules can be used for the customer-facing side of a provider offering, reducing cost.
Small-scale METN-SP's using switches might provide one or several VLANs per customer. The concern here is that they are sharing Spanning Tree Protocol (STP) domains with their customers. That seems highly unstable and somewhat unscalable.
More sophisticated METN-SPs will probably be using 802.IQ tunneling, also called QinQ. see the Cisco technical article link below for details of how QinQ operates. The basic idea is that an extra 802.IQ tag is applied to your frames. This "outer tag" is then used by the SP switches to switch traffic between your sites, your access ports. When the vendor runs out of VLAN's to use (4096 customer VLAN's?), they can then start using a separate set of Ethernet switches ("red cloud", "blue cloud", etc.), re-using VLAN numbers in each separate set of switches.
Think about how this behaves. Your MAC addresses are still visible, and are used to switch traffic between sites. That can result in flooding of unknown unicasts. IP multicast still has a multicast MAC address for destination (a sharp observation made to me by Keith Boblits) and floods to all your egress ports, as do broadcasts.
If the METN-SP requires you to use routers and not switches for the device connecting to this, then the number of unknown unicasts and broadcasts should be greatly reduced - a good thing. Edge routers would also greatly reduce the number of MAC addresses the edge SP switches would have to learn. Practically speaking, how would one enforce such a rule? Legal conditions may also prevent the SP from trying to enforce a "routers only" rule.
One design I've had in the back of my head is a moderately-sized school district, manufacturer, retail chain, etc. Putting switches in schools, all connected by METN to a HQ L3 switch, would provide a relatively inexpensive and simple to manage network. In fact, that model works well for shops with many sites, where the data network is viewed as a cost item rather than as strategic value-add. So this approach might be a managed service model that METN-SP's . would find attractive. The central L3 switch would mitigate flooding. The downside would be the greater number of MACs needing to be learned by the SP POP edge switch. I personally think stubby routing in such an environment would be simple enough that I'd use it instead.
As far as pros and cons go, on the plus side this scheme is also relatively simple for the SP to deploy. It does have some scalability issues and the number of customers go up, but partitioning (red/blue) and mixing in MPLS L2 transport may help with that.
This approach does have one significant negative aspect. The SP would be using Spanning Tree Protocol (STP) throughout their delivery network. We have been minimizing use of STP in our designs. This was due to concerns about convergence time and stability. In addition, it limits the size of "failure domains."
Rapid STP greatly mitigates concerns about convergence. It also increases general stability of STP. With traditional STP, we generally saw random instability set in when the Spanning Tree domain included 20 to 30 switches. We had a couple of early designs where we recommended against that approach, the customer insisted or had already built it, and we observed instability.
This was apparently partly due to STP domain diameter (number of switch hops), also due to increasing probability of ripple effects whenever a BPDU was delayed or lost. Faster switch CPUs have also helped with this. Since our designs now generally have at most access layer STP in them, there just seems to be little point to having larger scale STP (or potential failure) domains. We haven't been experimenting live with customer networks to test the stability of RSTP. So perhaps our rule of thumb (20 to 30 - likely instability) could be relaxed some.
Thus, I'm not going to say that a 10-20-30 switch local METN-SP switched network will be unstable. I do, however, view larger scale RSTP as a business risk until more experience is gained. The potential "failure domain" remains the entire switched network.
Furthermore, we have had enough experience with troubleshooting customers' STP loops that I can unequivocally state that diagnosing and fixing such a loop can be excruciating, and can take considerable time.