Jump to content

Cerebras Condor Galaxy 1 AI Supercomputer


Recommended Posts

Condor Galaxy 1 (CG-1) is a 4 exaFLOP, 54 million core, 64-node AI supercomputer that was unveiled in July 2023 by Cerebras Systems and G42, a leading AI and cloud company of the United Arab Emirates. It is the first in a series of nine supercomputers to be built and operated through a strategic partnership between Cerebras and G42. Upon completion in 2024, the nine inter-connected supercomputers will have 36 exaFLOPS of AI compute, making it one of the most powerful cloud AI supercomputers in the world.

What is CG-1 and what can it do?

CG-1 is a supercomputer designed for training large generative models, such as language models, image models, and video models. Generative models are AI models that can create new data based on existing data, such as generating text, images, or videos. They are useful for applications such as natural language processing, computer vision, speech synthesis, and content creation.

CG-1 is powered by Cerebras Wafer-Scale Engines (WSEs), which are the largest and most powerful AI processor chips in the world. Each WSE is made from a single silicon wafer that contains more than 2.6 trillion transistors and 850,000 AI-optimized cores. A single WSE can deliver more than 20 petaFLOPS of AI compute, which is equivalent to more than 10,000 GPUs.

CG-1 consists of 64 CS-2 systems, each containing one WSE. The CS-2 systems are connected by Cerebras Wafer-Scale Cluster (WSC) technology, which enables them to operate as a single logical accelerator. The WSC technology decouples memory from compute, allowing CG-1 to deploy 82 terabytes of parameter memory for AI models, compared to the gigabytes possible using GPUs. The WSC technology also provides 388 terabytes per second of internal bandwidth, which is essential for moving data across the supercomputer.

CG-1 uses a novel technique called weight streaming to train large models on wafer-scale clusters using just data parallelism. Weight streaming exploits the large-scale compute and memory features of the hardware and distributes work by streaming the model one layer at a time in a purely data parallel fashion. This simplifies the software complexity and eliminates the need for pipeline or model parallel schemes that are required to train large GPU models.

CG-1 can train large generative models faster and more efficiently than any other system in the world. For example, CG-1 can train a model with 100 billion parameters in just 10 days, while it would take more than a year on a GPU cluster. CG-1 can also train models with up to one trillion parameters, which are beyond the reach of GPUs.

Why is CG-1 important and who is using it?

CG-1 is important because it represents a breakthrough in AI hardware and software that enables unprecedented scale and speed for generative model training. CG-1 opens up new possibilities for AI research and innovation that were previously impossible or impractical.

CG-1 is used by G42 for its internal projects and initiatives, such as developing new generative models for various domains and languages. G42 also offers access to CG-1 to its customers and partners who want to leverage its capabilities for their own AI applications and solutions.

Additionally, CG-1 is part of the Cerebras Cloud, which allows customers to use Cerebras systems without procuring and managing hardware. The Cerebras Cloud provides an easy-to-use interface and API for users to submit their model training jobs to CG-1 or other Cerebras supercomputers. The Cerebras Cloud also provides tools and support for users to optimize their models for Cerebras hardware.

How does CG-1 compare to other AI supercomputers?

CG-1 is currently the largest and most powerful AI supercomputer in the world for generative model training. It surpasses other AI supercomputers in terms of compute power, memory capacity, bandwidth, and efficiency.

The following table compares CG-1 to some of the other notable AI supercomputers in the world:

Name Location Provider Compute Power Memory Capacity Bandwidth Efficiency
Condor Galaxy 1 (CG-1) Santa Clara, CA Cerebras/G42 4 exaFLOPS 82 TB 388 TBps ~8 GFLOPS/W
Andromeda Santa Clara, CA Cerebras 1 exaFLOP 20.5 TB 97 TBps ~8 GFLOPS/W
Selene Reno, NV Nvidia 0.63 exaFLOP 1.6 TB 14.7 TBps ~6 GFLOPS/W
Perlmutter Berkeley, CA NERSC 0.35 exaFLOP 1.3 TB 9.6 TBps ~3 GFLOPS/W
JUWELS Booster Module Jülich, Germany Forschungszentrum Jülich 0.23 exaFLOP 0.7 TB 5.2 TBps ~2 GFLOPS/W

As the table shows, CG-1 has more than six times the compute power, more than 50 times the memory capacity, and more than 26 times the bandwidth of the next best AI supercomputer, Selene. CG-1 also has a higher efficiency than other systems, delivering more AI performance per watt.

What are the future plans for CG-1 and the Condor Galaxy network?

CG-1 is the first of nine supercomputers that Cerebras and G42 plan to build and operate through their strategic partnership. The next two supercomputers, CG-2 and CG-3, will be deployed in Austin, TX and Asheville, NC in early 2024. The remaining six supercomputers will be deployed in various locations around the world by the end of 2024.

The nine supercomputers will be interconnected to form the Condor Galaxy network, which will have a total capacity of 36 exaFLOPS of AI compute, making it one of the most powerful cloud AI supercomputers in the world. The Condor Galaxy network will enable users to train models across multiple supercomputers and scale up to trillions of parameters.

The Condor Galaxy network will also serve as a platform for Cerebras and G42 to collaborate on developing new generative models and applications for various domains and languages. Cerebras and G42 will also share their models and results with the AI community through open source initiatives and publications.

Edited by BigFoot
Link to comment
Share on other sites

  • Create New...