NVLink

NVLink
Manufacturer	Nvidia
Type	Multi-GPU and CPU technology
Predecessor	Scalable Link Interface

NVLink

High speed chip interconnect

NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).^[1]

Quick Facts Manufacturer, Type ...

Performance

The following table shows a basic metrics comparison based upon standard specifications:

More information Interconnect, Transfer rate ...

The following table shows a comparison of relevant bus parameters for real world semiconductors that all offer NVLink as one of their options:

More information Semiconductor, Board/bus delivery variant ...

Semiconductor	Board/bus delivery variant	Interconnect	Transmission technology rate (per lane)	Lanes per sub-link (out + in)	Sub-link data rate (per data direction)	Sub-link or unit count	Total data rate (out + in)	Total lanes (out + in)	Total data rate (out + in)
Nvidia GP100	P100 SXM,^[9] P100 PCI-E^[10]	PCIe 3.0	08 GT/s	16 + 16 Ⓑ	128 Gbit/s = 16 GB/s	1	016 + 016 GB/s^[11]	32 Ⓒ	032 GB/s
Nvidia GV100	V100 SXM2,^[12] V100 PCI-E^[13]	PCIe 3.0	08 GT/s	16 + 16 Ⓑ	128 Gbit/s = 16 GB/s	1	016 + 016 GB/s	32 Ⓒ	032 GB/s
Nvidia TU104	GeForce RTX 2080, Quadro RTX 5000	PCIe 3.0	08 GT/s	16 + 16 Ⓑ	128 Gbit/s = 16 GB/s	1	016 + 016 GB/s	32 Ⓒ	032 GB/s
Nvidia TU102	GeForce RTX 2080 Ti, Quadro RTX 6000/8000	PCIe 3.0	08 GT/s	16 + 16 Ⓑ	128 Gbit/s = 16 GB/s	1	016 + 016 GB/s	32 Ⓒ	032 GB/s
Nvidia Xavier^[14]	(generic)	PCIe 4.0 Ⓓ 2 units: x8 (dual) 1 unit: x4 (dual) 3 units: x1^[15]^[16]	16 GT/s	08 + 08 Ⓑ 04 + 04 Ⓑ 1 + 010	128 Gbit/s = 16 GB/s 64 Gbit/s = 08 GB/s 16 Gbit/s = 02 GB/s	Ⓓ 2 1 3	Ⓓ 032 + 032 GB/s 008 + 008 GB/s 006 + 006 GB/s	40 Ⓑ	80 GB/s
IBM Power9^[17]	(generic)	PCIe 4.0	16 GT/s	16 + 16 Ⓑ	256 Gbit/s = 32 GB/s	3	096 + 096 GB/s	96	192 GB/s
Nvidia GA100^[18]^[19] Nvidia GA102^[20]	Ampere A100 (SXM4 & PCIe)^[21]	PCIe 4.0	016 GT/s	16 + 16 Ⓑ	256 Gbit/s = 32 GB/s	1	032 + 032 GB/s	32 Ⓒ	064 GB/s
Nvidia GP100	P100 SXM, (not available with P100 PCI-E)^[22]	NVLink 1.0	20 GT/s	08 + 08 Ⓐ	160 Gbit/s = 20 GB/s	4	080 + 080 GB/s	64	160 GB/s
Nvidia Xavier	(generic)	NVLink 1.0^[14]	20 GT/s^[14]	08 + 08 Ⓐ	160 Gbit/s = 20 GB/s^[23]
IBM Power8+	(generic)	NVLink 1.0	20 GT/s	08 + 08 Ⓐ	160 Gbit/s = 20 GB/s	4	080 + 080 GB/s	64	160 GB/s
Nvidia GV100	V100 SXM2^[24] (not available with V100 PCI-E)	NVLink 2.0	25 GT/s	08 + 08 Ⓐ	200 Gbit/s = 25 GB/s	6^[25]	150 + 150 GB/s	96	300 GB/s
IBM Power9^[26]	(generic)	NVLink 2.0 (BlueLink ports)	25 GT/s	08 + 08 Ⓐ	200 Gbit/s = 25 GB/s	6	150 + 150 GB/s	96	300 GB/s
NVSwitch for Volta^[27]	(generic) (fully connected 18x18 switch)	NVLink 2.0	25 GT/s	08 + 08 Ⓐ	200 Gbit/s = 25 GB/s	2 * 8 + 2 = 18	450 + 450 GB/s	288	900 GB/s
Nvidia TU104	GeForce RTX 2080, Quadro RTX 5000^[28]	NVLink 2.0	25 GT/s	08 + 08 Ⓐ	200 Gbit/s = 25 GB/s	1	025 + 025 GB/s	16	050 GB/s
Nvidia TU102	GeForce RTX 2080 Ti, Quadro RTX 6000/8000^[28]	NVLink 2.0	25 GT/s	08 + 08 Ⓐ	200 Gbit/s = 25 GB/s	2	050 + 050 GB/s	32	100 GB/s
Nvidia GA100^[18]^[19]	Ampere A100 (SXM4 & PCIe^[21])	NVLink 3.0	50 GT/s	04 + 04 Ⓐ	200 Gbit/s = 25 GB/s	12^[29]	300 + 300 GB/s	96	600 GB/s
Nvidia GA102^[20]	GeForce RTX 3090 Quadro RTX A6000	NVLink 3.0	28.125 GT/s	04 + 04 Ⓐ	112.5 Gbit/s = 14.0625 GB/s	4	56.25 + 56.25 GB/s	16	112.5 GB/s
NVSwitch for Ampere^[30]	(generic) (fully connected 18x18 switch)	NVLink 3.0	50 GT/s	08 + 08 Ⓐ	400 Gbit/s = 50 GB/s	2 * 8 + 2 = 18	900 + 900 GB/s	288	1800 GB/s
NVSwitch for Hopper^[30]	(fully connected 64 port switch)	NVLink 4.0	106.25 GT/s	09 + 09 Ⓐ	450 Gbit/s	18	900 GB/s	128	7200 GB/s
Nvidia Grace CPU^[31]	Nvidia GH200 Superchip	PCIe-5 (4x, 16x) @ 512 GB/s
Nvidia Grace CPU^[32]	Nvidia GH200 Superchip	NVLink-C2C @ 900 GB/s
Nvidia Hopper GPU^[33]	Nvidia GH200 Superchip	NVLink-C2C @ 900 GB/s
Nvidia Hopper GPU^[34]	Nvidia GH200 Superchip	NVLink 4 (18x) @ 900 GB/s

Note: Data rate columns were rounded by being approximated by transmission rate, see real world performance paragraph

Ⓐ: sample value; NVLink sub-link bundling should be possible

Ⓑ: sample value; other fractions for the PCIe lane usage should be possible

Ⓒ: a single (no! 16) PCIe lane transfers data over a differential pair

Ⓓ: various limitations of finally possible combinations might apply due to chip pin muxing and board design

dual: interface unit can either be configured as a root hub or an end point

generic: bare semiconductor without any board design specific restrictions applied

Real world performance could be determined by applying different encapsulation taxes as well usage rate. Those come from various sources:

128b/130b line code (see e.g. PCI Express data transmission for versions 3.0 and higher)
Link control characters
Transaction header
Buffering capabilities (depends on device)
DMA usage on computer side (depends on other software, usually negligible on benchmarks)

Those physical limitations usually reduce the data rate to between 90 and 95% of the transfer rate. NVLink benchmarks show an achievable transfer rate of about 35.3 Gbit/s (host to device) for a 40 Gbit/s (2 sub-lanes uplink) NVLink connection towards a P100 GPU in a system that is driven by a set of IBM Power8 CPUs.^[35]

Usage with plug-in boards

For the various versions of plug-in boards (a yet small number of high-end gaming and professional graphics GPU boards with this feature exist) that expose extra connectors for joining them into a NVLink group, a similar number of slightly varying, relatively compact, PCB based interconnection plugs does exist. Typically only boards of the same type will mate together due to their physical and logical design. For some setups two identical plugs need to be applied for achieving the full data rate. As of now the typical plug is U-shaped with a fine grid edge connector on each of the end strokes of the shape facing away from the viewer. The width of the plug determines how far away the plug-in cards need to be seated to the main board of the hosting computer system - a distance for the placement of the card is commonly determined by the matching plug (known available plug widths are 3 to 5 slots and also depend on board type).^[36]^[37] The interconnect is often referred as Scalable Link Interface (SLI) from 2004 for its structural design and appearance, even if the modern NVLink based design is of a quite different technical nature with different features in its basic levels compared to the former design. Reported real world devices are:^[38]

Quadro GP100 (a pair of cards will make use of up to 2 bridges;^[39] the setup realizes either 2 or 4 NVLink connections with up to 160 GB/s^[40] - this might resemble NVLink 1.0 with 20 GT/s)
Quadro GV100 (a pair of cards will need up to 2 bridges and realize up to 200 GB/s^[36] - this might resemble NVLink 2.0 with 25 GT/s and 4 links)
GeForce RTX 2080 based on TU104 (with single bridge "GeForce RTX NVLink-Bridge"^[41])
GeForce RTX 2080 Ti based on TU102 (with single bridge "GeForce RTX NVLink-Bridge"^[37])
Quadro RTX 5000^[42] based on TU104^[43] (with single bridge "NVLink" up to 50 GB/s^[44] - this might resemble NVLink 2.0 with 25 GT/s and 1 link)
Quadro RTX 6000^[42] based on TU102^[43] (with single bridge "NVLink HB" up to 100 GB/s^[44] - this might resemble NVLink 2.0 with 25 GT/s and 2 links)
Quadro RTX 8000^[42] based on TU102^[45] (with single bridge "NVLink HB" up to 100 GB/s^[44] - this might resemble NVLink 2.0 with 25 GT/s and 2 links)

History

On 5 April 2016, Nvidia announced that NVLink would be implemented in the Pascal-microarchitecture-based GP100 GPU, as used in, for example, Nvidia Tesla P100 products.^[48] With the introduction of the DGX-1 high performance computer base it was possible to have up to eight P100 modules in a single rack system connected to up to two host CPUs. The carrier board (...) allows for a dedicated board for routing the NVLink connections – each P100 requires 800 pins, 400 for PCIe + power, and another 400 for the NVLinks, adding up to nearly 1600 board traces for NVLinks alone (...).^[49] Each CPU has direct connection to 4 units of P100 via PCIe and each P100 has one NVLink each to the 3 other P100s in the same CPU group plus one more NVLink to one P100 in the other CPU group. Each NVLink (link interface) offers a bidirectional 20 GB/sec up 20 GB/sec down, with 4 links per GP100 GPU, for an aggregate bandwidth of 80 GB/sec up and another 80 GB/sec down.^[50] NVLink supports routing so that in the DGX-1 design for every P100 a total of 4 of the other 7 P100s are directly reachable and the remaining 3 are reachable with only one hop. According to depictions in Nvidia's blog-based publications, from 2014 NVLink allows bundling of individual links for increased point to point performance so that for example a design with two P100s and all links established between the two units would allow the full NVLink bandwidth of 80 GB/s between them.^[51]

At GTC2017, Nvidia presented its Volta generation of GPUs and indicated the integration of a revised version 2.0 of NVLink that would allow total I/O data rates of 300 GB/s for a single chip for this design, and further announced the option for pre-orders with a delivery promise for Q3/2017 of the DGX-1 and DGX-Station high performance computers that will be equipped with GPU modules of type V100 and have NVLink 2.0 realized in either a networked (two groups of four V100 modules with inter-group connectivity) or a fully interconnected fashion of one group of four V100 modules.

In 2017-2018, IBM and Nvidia delivered the Summit and Sierra supercomputers for the US Department of Energy^[52] which combine IBM's POWER9 family of CPUs and Nvidia's Volta architecture, using NVLink 2.0 for the CPU-GPU and GPU-GPU interconnects and InfiniBand EDR for the system interconnects.^[53]

In 2020, Nvidia announced that they will no longer be adding new SLI driver profiles on RTX 2000 series and older from January 1, 2021.^[54]

Share this article:

This article uses material from the Wikipedia article NVLink, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[1] [1]
Nvidia NVLINK 2.0 arrives in IBM servers next year by Jon Worrel on fudzilla.com on August 24, 2016

[2] [2]
"NVIDIA DGX-1 With Tesla V100 System Architecture" (PDF).

[3] [3]
"What Is NVLink?". Nvidia. 2014-11-14.

[anand-A100-4] [4]
Ryan Smith (May 14, 2020). "NVIDIA Ampere Unleashed: NVIDIA Announces New GPU Architecture, A100 GPU, and Accelerator". AnandTech.

[5] [5]
Jacobs, Blair (2022-03-23). "Nvidia reveals next-gen Hopper GPU architecture". Club386. Retrieved 2022-05-04.

[auto3-6] [6]
"PCIe - PCI Express (1.1 / 2.0 / 3.0 / 4.0 / 5.0)". www.elektronik-kompendium.de.

[7] [7]
January 2019, Paul Alcorn 17 (17 January 2019). "PCIe 5.0 Is Ready For Prime Time". Tom's Hardware.{{cite web}}: CS1 maint: numeric names: authors list (link)

[8] [8]
"NVLink-Network Switch - NVIDIA's Switch Chip for High Communication-Bandwidth SuperPODs" (PDF). HotChips 34. 23 August 2022.

[9] [9]
online, heise. "NVIDIA Tesla P100 [SXM2], 16GB HBM2 (NVTP100-SXM) | heise online Preisvergleich / Deutschland". geizhals.de.

[10] [10]
online, heise (14 August 2023). "PNY Tesla P100 [PCIe], 16GB HBM2 (TCSP100M-16GB-PB/NVTP100-16) ab € 4990,00 (2020) | heise online Preisvergleich / Deutschland". geizhals.de.

[11] [11]
NVLink Takes GPU Acceleration To The Next Level by Timothy Prickett Morgan at nextplatform.com on May 4, 2016

[12] [12]
"NVIDIA Tesla V100 SXM2 16 GB Specs". TechPowerUp. 14 August 2023.

[13] [13]
online, heise (14 August 2023). "PNY Quadro GV100, 32GB HBM2, 4x DP (VCQGV100-PB) ab € 10199,00 (2020) | heise online Preisvergleich / Deutschland". geizhals.de.

[auto9-14] [14]
Tegra Xavier - Nvidia at wikichip.org

[15] [15]
JETSON AGX XAVIER PLATFORM ADAPTATION AND BRING-UP GUIDE "Tegra194 PCIe Controller Features" on page 14; stored at arrow.com

[16] [16]
How to enable PCIe x2 slot with Xavier? on devtalk.nvidia.com

[17] [17]
POWER9 Webinar presentation by IBM for Power Systems VUG by Jeff Stuecheli on January 26, 2017

[auto-18] [18]
Morgan, Timothy Prickett (May 14, 2020). "Nvidia Unifies AI Compute With "Ampere" GPU". The Next Platform.

[auto1-19] [19]
"Data sheet" (PDF). www.nvidia.com. Retrieved 2020-09-15.

[nvidia.com-20] [20]
"NVIDIA ampere GA102 GPU Architecture Whitepaper" (PDF). nvidia.com. Retrieved 2 May 2023.

[ReferenceA-21] [21]
"Tensor Core GPU" (PDF). nvidia.com. Retrieved 2 May 2023.

[22] [22]
All aboard the PCIe bus for Nvidia's Tesla P100 supercomputer grunt by Chris Williams at theregister.co.uk on June 20, 2016

[23] [23]
Hicok, Gary (November 13, 2018). "NVIDIA Xavier Achieves Milestone for Safe Self-Driving | NVIDIA Blog". The Official NVIDIA Blog.

[24] [24]
online, heise (22 June 2017). "Nvidia Tesla V100: PCIe-Steckkarte mit Volta-Grafikchip und 16 GByte Speicher angekündigt". heise online.

[25] [25]
GV100 Blockdiagramm in "GTC17: NVIDIA präsentiert die nächste GPU-Architektur Volta - Tesla V100 mit 5.120 Shadereinheiten und 16 GB HBM2" by Andreas Schilling on hardwareluxx.de on May 10, 2017

[26] [26]
NVIDIA Volta GV100 GPU Chip For Summit Supercomputer Twice as Fast as Pascal P100 – Speculated To Hit 9.5 TFLOPs FP64 Compute by Hassan Mujtaba at wccftech.com on December 20, 2016

[27] [27]
"Technical overview" (PDF). images.nvidia.com. Retrieved 2020-09-15.

[tomshardware-geforcertx-28] [28]
Angelini, Chris (14 September 2018). "Nvidia's Turing Architecture Explored: Inside the GeForce RTX 2080". Tom's Hardware. p. 7. Retrieved 28 February 2019. TU102 and TU104 are Nvidia's first desktop GPUs rocking the NVLink interconnect rather than a Multiple Input/Output (MIO) interface for SLI support. The former makes two x8 links available, while the latter is limited to one. Each link facilitates up to 50 GB/s of bidirectional bandwidth. So, GeForce RTX 2080 Ti is capable of up to 100 GB/s between cards and RTX 2080 can do half of that.

[29] [29]
Schilling, Andreas (22 June 2020). "A100 PCIe: NVIDIA GA100-GPU kommt auch als PCI-Express-Variante". Hardwareluxx. Retrieved 2 May 2023.

[:0-30] [30]
"NVLINK AND NVSWITCH". www.nvidia.com. Retrieved 2021-02-07.

[31] [31]
https://www.hpcwire.com/2024/02/22/a-big-memory-nvidia-gh200-next-to-your-desk-closer-than-you-think/

[32] [32]
https://www.hpcwire.com/2024/02/22/a-big-memory-nvidia-gh200-next-to-your-desk-closer-than-you-think/

[33] [33]
https://www.hpcwire.com/2024/02/22/a-big-memory-nvidia-gh200-next-to-your-desk-closer-than-you-think/

[34] [34]
https://www.hpcwire.com/2024/02/22/a-big-memory-nvidia-gh200-next-to-your-desk-closer-than-you-think/

[35] [35]
Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers by Eliot Eshelman on microway.com on January 26, 2017

[auto2-36] [36]
"NVIDIA Quadro NVLink Grafikprozessor-Zusammenschaltung in Hochgeschwindigkeit". NVIDIA.

[auto7-37] [37]
"Grafik neu erfunden: NVIDIA GeForce RTX 2080 Ti-Grafikkarte". NVIDIA.

[auto5-38] [38]
"NVLink on NVIDIA GeForce RTX 2080 & 2080 Ti in Windows 10". Puget Systems. 5 October 2018.

[39] [39]
^{[dead link]}

[40] [40]
Schilling, Andreas (5 February 2017). "NVIDIA präsentiert Quadro GP100 mit GP100-GPU und 16 GB HBM2". Hardwareluxx.

[41] [41]
"NVIDIA GeForce RTX 2080 Founders Edition Graphics Card". NVIDIA.

[auto6-42] [42]
"NVIDIA Quadro Graphics Cards for Professional Design Workstations". NVIDIA.

[auto8-43] [43]
"NVIDIA Quadro RTX 6000 und RTX 5000 Ready für Pre-Order". October 1, 2018.

[auto4-44] [44]
"NVLink | pny.com". www.pny.com.

[45] [45]
"NVIDIA Quadro RTX 8000 Specs". TechPowerUp. 14 August 2023.

[46] [46]
"NvLink Methods". docs.nvidia.com.

[47] [47]
"NVIDIA Collective Communications Library (NCCL)". NVIDIA Developer. May 10, 2017.

[48] [48]
"Inside Pascal: NVIDIA's Newest Computing Platform". 2016-04-05.

[49] [49]
Anandtech.com

[50] [50]
NVIDIA Unveils the DGX-1 HPC Server: 8 Teslas, 3U, Q2 2016 by anandtech.com on April, 2016

[51] [51]
How NVLink Will Enable Faster, Easier Multi-GPU Computing by Mark Harris on November 14, 2014

[52] [52]
"Whitepaper: Summit and Sierra Supercomputers" (PDF). 2014-11-01.

[53] [53]
"Nvidia Volta, IBM POWER9 Land Contracts For New US Government Supercomputers". AnandTech. 2014-11-17.

[54] [54]
"RIP: Nvidia slams the final nail in SLI's coffin, no new profiles after 2020". PC World. 2020-09-18.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

Interconnect	Transfer rate	Line code	Effective payload rate per lane per direction	Max total lane length (PCIe: incl. 5" for PCBs)	Realized in design
PCIe 1.x	2.5 GT/s	8b/10b	~0.25 GB/s	20" = ~51 cm
PCIe 2.x	5 GT/s	8b/10b	~0.5 GB/s	20" = ~51 cm
PCIe 3.x	8 GT/s	128b/130b	~1 GB/s	20" = ~51 cm^[6]	Pascal, Volta, Turing
PCIe 4.0	16 GT/s	128b/130b	~2 GB/s	8−12" = ~20−30 cm^[6]	Volta on Xavier (8x, 4x, 1x), Ampere, Power 9
PCIe 5.0	32 GT/s^[7]	128b/130b	~4 GB/s		Hopper
PCIe 6.0	64 GT/s	1b/1b	~8 GB/s		Blackwell
NVLink 1.0	20 Gbit/s		~2.5 GB/s		Pascal, Power 8+
NVLink 2.0	25 Gbit/s		~3.125 GB/s		Volta, NVSwitch for Volta Power 9
NVLink 3.0	50 Gbit/s		~6.25 GB/s		Ampere, NVSwitch for Ampere
NVLink 4.0 (also as C2C, chip-to-chip)	100 Gbit/s ^[8]		~6.25 GB/s		Hopper, Nvidia Grace Datacenter/Server CPU NVSwitch for Hopper
NVLink 5.0 (also as C2C, chip-to-chip)	200 Gbit/s				Blackwell, Nvidia Grace Datacenter/Server CPU NVSwitch for Blackwell

NVLink

NVLink

Principle

Performance

Usage with plug-in boards

Service software and programming

History

See also

References

Share this article: