Social media giant, Meta has announced that it has designed and built the AI Research SuperCluster (RSC), one of the fastest AI supercomputers running today and will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022.
According to a statement by the tech company, RSC will help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more.
Meta (parent company of Facebook) also said that researchers have already started using RSC to train large models in computer vision, NLP, speech recognition, and computer vision for research, with the aim of one-day training models with trillions of parameters. According to the statement, Meta says:
“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together. Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”
Features of the RSC
AI supercomputers are built by combining multiple GPUs into compute nodes, which are then connected by a high-performance network fabric to allow fast communication between those GPUs. RSC comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs — with each A100 GPU being more powerful than the V100 used in the previous system.
The GPUs communicate via an NVIDIA Quantum 200 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
This way, it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster. That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.
Built during the pandemic
According to Meta, the RSC was built following a period of work with a number of long-time partners all of whom also helped design the first generation of its AI infrastructure in 2017. These include Penguin Computing, Pure Storage and NVIDIA. And, this was done mostly remotely, during the pandemic.
RSC began as a completely remote project that the team took from a simple shared document to a functioning cluster in about a year and a half. COVID-19 and industry-wide wafer supply constraints also brought supply chain issues that made it difficult to get everything from chips to components like optics and GPUs.
To build this cluster efficiently, Meta had to design it from scratch, creating many entirely new Meta-specific conventions and rethinking previous ones including writing new rules around the data centre designs. These include cooling, power, rack layout, cabling, and networking (including a completely new control plane).
Meta also included a fascinating feature, the AI Research Store (AIRStore). AIRStore utilizes a new data preparation phase that preprocesses the data set to be used for training. . Once the preparation is performed one time, the prepared data set can be used for multiple training runs until it expires. AIRStore also optimizes data transfers so that cross-region traffic on Meta’s inter-datacenter backbone is minimized.
Data Safety with RSC
RSC has been designed from the ground up with privacy and security in mind so that Meta’s researchers can safely train models using encrypted user-generated data that is not decrypted until right before training. For example, RSC is isolated from the larger internet, with no direct inbound or outbound connections, and traffic can flow only from Meta’s production data centres.
To meet privacy and security requirements, Meta says that the entire data path from storage systems to the GPUs is end-to-end encrypted and has the necessary tools and processes to verify that these requirements are met at all times.
Before data is imported to RSC, it must go through a privacy review process to confirm it has been correctly anonymized. The data is then encrypted before it can be used to train AI models and decryption keys are deleted regularly to ensure older data is not still accessible.
Still in the works…
RSC is up and running today, but its development is ongoing. Meta believes that once complete it will be the fastest AI supercomputer in the world, performing at nearly 5 exaflops of mixed-precision compute.
Join the community now!