Scaling Network Processor Performance to 40 Gbps P. Sagmeister G. Dittmann A. Herkersdorf D. Webb Phone: +41-1-724 8912 e-mail: psa@zurich.ibm.com IBM Research, Zurich Research Laboratory Säumerstr. 4/Postfach CH-8803 Rüschlikon, Switzerland An exponential increase in traffic volume and the deployment of differentiated services has put substantial stress on today's network processing equipment. With the speed of optical fibers growing much faster than the speed and memory bandwidth of network processors a large performance gap has opened up between the two. A substantial step towards closing this gap is the scaling of network processor performance. The presented Load Balancer approach scales network processor performance by distributing the load of a high-speed link in real-time onto several concurrently operating, independent network processors. This is done in a flow-preserving manner, ensuring a local validity of connection state information within the individual network processors, which avoids inter-processor communication. With this Load Balancer concept a number of network processors, e.g. 4, 8 or 16, are able to service a single high-speed link with up to 40 Gbps. Currently this concept is being implemented in a standard 0.18um CMOS process for full OC-768 line speed. The Load Balancer establishes associations between traffic flows and network processors. The flow to which individual packets belong is identified by information extracted out of the packet header. This flow ID is compressed into an index of a fixed length using a static hash function. Hashing of a particular flow ID always results in the same index. The hash index identifies a specific entry in a lookup table, which delivers the number of an output queue connected to the target network processor. This two-stage association approach is superior to hashing directly to a network processor number because it makes flexible, traffic dependent reassociations possible. A good opportunity to revise the association is the end of the life time of a flow, which is detected by a time stamp stored in the lookup table. New flows can be directed to the currently least loaded network processor without disturbing flow preservation. This allows to achieve a more uniform distribution of the traffic across network processors, directly observed at the fill level of the output queues. Queuing of packets between the Load Balancer and the network processors is necessary to compensate speed differences among the high-speed link and the connected network processors. Key requirement for an OC-768 Load Balancer implementation is to avoid an 80 Gbps off-chip memory interface. Therefore the overall size of the queues must be bounded without packet loss even under worst-case conditions to enable a single chip hardware implementation. Performance validation with an extensive test suite of extrapolated real traces resembling full 40 Gbps link load has shown only two cases where the so far described concept cannot meet the buffer size requirements: (a) imbalances in flow allocation or (b) a single flow exceeding the total capacity of its assigned network processor (excessive flow). The first situation can occur if the traffic characteristics change such that the flows assigned to a network processor exceed its capacity. As a consequence flows are reassigned to the least loaded network processor queue. In case of an excessive flow a reassignment would only shift the problem from one network processor to an other. However, spraying packets of an excessive flow instantaneously to the least loaded queue, guaranties that conforming flows are not effected. In conclusion, the Load Balancer is a generic, vendor independent concept to scale the performance of existing network processor solutions. Due to the parallel processor structure, it is inherently fault tolerant in that the failure of a single network processor does not result in a total high-speed link loss.