NPA Awards Categories

During the last seven years, even after the technology market bust of 2001, NetApp has experienced 20-30% annual growth. Over these seven years, NetApp has grown as a company from a small startup, with a startup mentality, into a global enterprise that requires maturity in all parts of the organization, including its network.

When I arrived in 2005, NetApp's global network had grown to support nearly 100 sites and 5,000 users, not to mention partners, vendors, and customers. However, the corporate network was built on outdated hardware with no software standards, expensive WAN circuits, no configuration standards or templates, and a single OSPF autonomous system routing domain that could not scale any further. The company had quickly grown from a startup to an enterprise. Now the network needed to also mature.

The first task that needed to be done was to set the groundwork for network standards and network design templates that could be used to build a new network. Our team, with my leadership, worked for nearly a year on a team based, tested, and completely documented network architecture. Just like a house, nothing in a network should be built before you know how to build it. The Network Architecture defines the network via standards and templates. Standards drive consistency, simplify deployment, and ease operational support. Templates speed deployment and enforce network design homogeneity.

Furthermore, nothing is a standard until it is agreed to and written down. These are two key aspects that network teams often miss. First, you must agree to the standards as a team, working to flush out problems, differences of opinion, and drive acceptance. Second, everything that is agreed to as a standard must be written down in an architecture document. Simply saying our standard is "like the site in NYC" is not a standard; it is an implementation of the standard.

Finally, standards must be agreed to as a team, not dictated by one or a few individuals. Now, this certainly does not mean every person on a network team should be involved in setting standards. However, several key engineers and architects should meet regularly, as a board that I chair, to develop and review the architecture. NetApp has accomplished this via the Network Architecture Review Board (nARB). The nARB meets weekly to discuss pressing architecture topics. Proposals are made, actions items assigned, and, ultimately, new standards or changes to existing standards are approved and included the Network Architecture. The Revision Process governs how changes are incorporated into the Architecture.

While we worked on the new network architecture, we concurrently began the procurement process for a new global network based on MPLS IP VPNs. I was the project lead with overall responsibility to deliver the new global network. I also designed the global routing for the new network, based on an incredibly scalable and elegant BGP and OSPF design.

We began by defining requirements for the new global network including performance, SLAs, costs, and technologies. We then developed an RFI which led to a written RFP. That RFP process ultimately led to the selection of a new global MPLS IP VPN provider for NetApp.

Over the last three years, MPLS VPN services provided by carriers (RFC 2547bis), has exploded into enterprise networks. Since MPLS carriers provide a layer-3 environment for VPN customers, a routing protocol is needed between carriers and customers to exchange routing information. While many carriers will support OSPF, RIP (gasp!), or EIGRP on links to customers, none do it enthusiastically. Some require special engineering approval. Since their MPLS backbones are based on BGP, they want enterprise customer to use BGP.

However, on traditional global enterprise networks, BGP was used to interconnect regional IGP autonomous systems. eBGP was used on the few large, expensive, international circuits that connected the regions. Inside the region an IGP - either OSPF or EIGRP - was used for in-country routing. Sometimes there was iBGP, but there were often synchronization problems that led to routing loops. Often eBGP was just redistributed into the IGP, but that was incredibly messy. This was the after affects of hastily done network conversions. It was easier to redistribute than build a proper iBGP network inside each region that was supported by the IGP.

In 2005, NetApp's global network routing architecture was a single OSPF autonomous system. Without documented standards implementations of area border routers, network advertisement, and routing summarization were very different from site to site. This limited OSPF's scalability and long-term usage in the network. At the same time, NetApp was conducting a RFP to replace the legacy WAN with MPLS. The NetApp Networking Team knew BGP had to be used, but how it would integrate into the legacy OSPF network was an issue.

Considering what NetApp's engineers had experienced in traditional enterprise networks, our legacy OSPF setup, and the carriers' requirement to use BGP, the NetApp Networking Team conducted a week-long Network Engineering Conference in the summer of 2006 to build a highly scalable and proper BGP design to support the forthcoming MPLS conversion. During a week-long engineering conference the team toyed with routing protocol redistribution. Then we delved into iBGP with an IGP inside each region. We were held back by the belief that since OSPF, at the time, was our core routing protocol, we had to continue using it in the same fashion. However, as we worked through the design, it became clear the best path was the embrace BGP as our core routing protocol and relegate OSPF to a site routing protocol. When we made that decision, our new routing design began to flourish.

The basic tenant of the design is each site in the network - no matter the size - is in its own BGP AS using private BGP AS numbers. That provides 1,024 AS numbers which is more than enough for most networks (MPLS carriers can AS override remote sites should a network have more than 1,024 sites). With each site in its own AS, the WAN links at each site - be they MPLS, private-line, or GRE tunnel - would run eBGP. Now, BGP became the core WAN routing protocol. This met the MPLS carriers' requirements and made NetApp's WAN routing much simpler. We now had a protocol with the scalability to handle thousands of routes and with enough protocol features (filter-lists, route attributes, communities, etc.) to implement routing policy (something OSPF lacks).

Next we developed one of the best parts, and key differentiator, of the BGP design. At each site in the network all traffic flows through the "core" (a pair of high-end, chassis-based routers). So, this rule was used to design BGP placing the core routers at the center of BGP at the site. These core routers serve as an iBGP route-reflector cluster that peer iBGP to the WAN routers. Using a route-reflector cluster avoided the need for an iBGP full-mesh. The core routers create all BGP routes and advertise those routes to the WAN routers via iBGP. The WAN routers then advertise those routes to eBGP peers over the WAN (MPLS carriers, other sites, etc). Filtering policy is done at the edge on the WAN routers. In the other direction, the core routers learn routes for external sites via iBGP from the WAN routers (who have already learned the routes via eBGP). Thus, the core routers know all routes in the entire network. BGP easily scales to handle these global routes, unlike OSPF which does not handle thousands of routes well. This sets up a very elegant and fast BGP design. Failover is within 5 seconds when a WAN link goes down and the design can scale quickly.

OSPF is still used, but only for local LAN routing and to allow iBGP to establish between loopback interfaces. The core routers, since all packets flow through them, advertise a default route (0.0.0.0/0) into OSPF. A packet from a host that enters the LAN access layer that is destined for a remote site follows the default route to the core. Since, as mentioned, the core routers know all the routes in the network, they make the proper next-hop decision to the appropriate WAN router based on iBGP. By using a default route to bring traffic to the core routers, no routing protocol redistribution is necessary. This makes OSPF and BGP very stable.

This forms the principal routing protocol design of our network now. The design has proven very resilient, scalable, and flexible. It has stabilized our routing tables and removed OSPF's inefficiencies. However, flexibility, provided by BGP, is the key attribute that allowed us to do more with our network.

We have extended this design to include dynamic Internet access and a separate lab network to facilitate software development. All of this work cut overall telecom costs by 18% with a 6-month ROI. Scalability has been incredibly enhanced and availability is near 4 nines.
During the last seven years, even after the technology market bust of 2001, NetApp has experienced 20-30% annual growth. Over these seven years, NetApp has grown as a company from a small startup, with a startup mentality, into a global enterprise that requires maturity in all parts of the organization, including its network.