The bandwidth rate challenge brought by AI
According to public data, the number of parameters in the model has increased from 110 million to 175 billion from GPT-1 to GPT-3, and the parameters of the GPT-4 model have reached 1.8 trillion. According to the data of market research organization TrendForce, if calculated based on the processing power of NVIDIA A100 graphics card, the GPT-3.5 large model requires 20,000 GPUs to process training data.
ChatGPT, together with the influx of AI large models, has not only improved the efficiency of retrieval and work in the AIGC (artificial intelligence generated content) industry, but also created a higher demand for high-speed storage and data exchange in addition to GPU computing, as large models require a large amount of data and computing power for training and operation, and generate a large amount of new data during use.
The capacity of exchange chips doubles every two years
Broadcom has always represented the most advanced solution for switching chips. The Tomahawk system chip is a high-bandwidth switch platform chip, and the Tomahawk 5 is a 51.2T switching capacity chip with a 5 nanometer process introduced by Broadcom, mainly targeting the hyperscale enterprise and cloud service commercial switch and router chip markets.
The Tomahawk 5 can drive a total bandwidth of 51.2TB/s, 64 ports running at 800Gb/s, 128 ports running at 400Gb/s, and 256 ports running at 200Gb/s. Very large enterprises and cloud service providers prefer switches with a 128-port base and a 400Gb/s operating speed per port. Among them, 64 ports are connected down to servers in the rack, and 64 ports are connected up to the backbone layer of the network structure. The AI server cluster can use 256 ports running at 200Gb/s, which can greatly meet the data exchange requirements of AI computing.
These high-capacity switching chips are generally large FCBGA packages with a chip size greater than 60cm² and more than 8000 LGA pins.
The interconnection scheme of 800G switch based on light transmission and copper retreat
As the transmission rate of high-speed signals increases from 50G to 100G to 200G, the loss of the transmission system increases from less than 10dB to more than 20dB. When the signal transmission rate exceeds 200G, the interconnection complexity of PCB boards increases, and almost all signal line routing lengths will exceed the loss budget of 1m DAC line transmission.
Celestica demonstrated an 800G switch using Broadcom TomaHawk 5 switching chips at the OCP Summit 2023. It uses copper technology with flyover cables to reduce cabling and heat dissipation issues.
The flyover cables copper technology has successfully solved the design and technical challenges of 800G switches:
It is easier to route to pluggable optical modules (shorter cable length and better signal quality). Since 800G switches are based on 112Gpbs speeds and dense switching line routing, it is difficult to minimize signal noise when routing on the PCB. The main solutions are to choose a co-packaged optics (CPO) solution or an easy-to-operate/low-cost flyover cables copper solution;
Power consumption and heat dissipation issues: The 800G switching chip and optical module require larger heat sinks for cooling. Due to the use of the flyover cables copper cable solution, there is a large amount of space in between, which is beneficial for installing large heat sinks.
Co-packaged optics (CPO) is another alternative. As 800G switches are being produced and the speed reaches 224Gbps, CPO will become increasingly popular.
Evolution of optical module integration
On current servers, data is transmitted to switches or servers by converting optical signals into electrical signals through optical modules.
In a co-packaged optical device (CPO), functional devices such as optical receivers, amplifiers, and DSPs on optical modules are integrated onto a PCB carrier board. Through the CPO solution, power consumption can be reduced by 30%, cost can be reduced by 40%, and data exchange density can be increased, resulting in a smaller volume.
Ruijie Networks 25.6T silicon optical CPO switch
Ruijie Networks 51.2T silicon optical NPO switch
Note: The above content is collated from the Internet, and the copyright belongs to the original author. If there is any infringement, please contact us for deletion.