How large can a cloud instance scale-up its interconnected accelerators?

Kanta Yamaoka — March 8th, 2024

DISCLAIMERS: The post is my personal learning notes, and not representing the views of my current/past employers. The content is written with care, but if you find mistakes, please let the author know, my Linkedin.

Introduction
With the emergenece of scaling-law and resultant "Large" language models, the demand for cloud computing is increasing. However, to me, as a learner, I always have had a question, "how large" can we reach with the current off-the-shelf public cloud? Concretely, the post aims to explore two aspects:

(Q1): For large-scale ML training on the off-the-shelf Cloud, how large can an instance/cluster become, from the maximum #GPUs in mind?
(Q2): In the off-shelf-cloud, among different cloud providers and products, which options are available for small-to-mediam workloads to large-scale trainings? Take AWS for example.

Maximum accelerators interconnected per Cloud instance

For the latest large-scale ML training, NVIDIA H100 GPUs or Google’s TPU are increasingly popular becaues they offer fast training/inference time due to low-precision computation. Hence, looking at Cloud products with those accelerators will give us a practical overview for comparisons. In summary, cloud providers offer similarly competitive scale instances/clusters, and the scale is already huge due to today’s large-scale ML training. In some cases, they can scale up to multiple 10Ks interconnected accelerators:

Two technology companies, InflectionAI and CoreWeave, offer a cluster of approx 4K H100s inter-connected with InfiniBand network¹.
AWS offers EC2 P5 UltraClusters which can interconnect up to 20k H100 GPUs², or EC2 P4d UltraClusters which can interconnect more than 4k A100 GPUs³.
At Google Cloud, +50k TPUs are used⁴ for LLM training and they offer Cloud product “Cloud TPU Multislice Training.”
Google Cloud offers A3 H100, which can scale up to multiple 10Ks GPUs⁵.
- The A3 is similar to EC2 UltraClusters P5 while the number of maximum interconnected GPUs per cluster is unclear in A3.

The number of GPUs can scale well, but the cost isn't negligble
AWS and Google Cloud provide approx multiple 10K interconnected accelerators per instance. To me, this is a lot. However, using such interconnected GPU instances per se seems challenging from the cost point of view, as well as engineering difficulties. For example, a technical report estimates trainig costs of GPT-3 might take $4.6 with a Cloud instance⁶. While the estimate is a bit old language model (GPT-3) and an old GPU model, V100, you can easily imagine how costly such trainings could be at the moment. Unfotunately, today's post does not cover comparisons among cloud providers or on-prem from cost perspectives, which leaves me future work.

GPU instances/clusters in AWS EC2

To take example from one of the most common cloud providers, AWS. There are two options, which I'll disucss about two points in this paragraph:

(a) Using P/G instances for small to medium-scale workloads
(b) Using UltraClusters to work on large-scale distributed learning

(a) Using P/G instances for small to medium-scale workloads
AWS provides “P instances” or “G instances” as GPU accelerated computing. P instances are for GP-GPU ML computing; G instances are for graphic processing purposes. In a P3 instance (V100), for example, you can get 1-16 GPUs per instance. In this case, multiple GPUs are interconnected via the NVLink network, enabling fast GPU-GPU bandwidth. In contrast, in a G4 instance, GPUs are connected via PCIe communication, resulting in slower bandwidth for GPU-GPU communication, but cheaper than P3⁷. For the latest H100 GPUs, P5 instances are available which can inter connect 8 H100 GPUs per instance¹⁰.

(b) Using UltraClusters to work on large-scale distributed learning

Those P and G instances in the previous section are for individual training, rather than using multiple GPU nodes. If you want to use more than 16 GPUs, those instances are not the right choice. Instead, AWS offers distributed training instances that are different from EC2 P/G instances ⁸.

AWS offers a GPU instance with more GPUs than a normal GPU instance can offer. According to NVIDIA, the GPU provider, in theory, NVLink offers max 256 GPU connections with NVSwitch⁹. However, AWS offers more interconnected GPUs per instance. “EC2 P4d UltraClusters” or "EC2 P5 instances on UltraClusters". An instance can have more than 4k A100 GPUs according to AWS in the former. In the latter, UltraCluster can interconnect 20k H100 GPUs¹⁰.

This is the end of my post:
Hope you have got an idea of current scales of today's off-the-shelf cloud. This time, I didn't cover Google Cloud as I'm already familiar with the platform. See you soon, cheers!

Kanta