Code Zero Group

Traditional HA (High Availability) design best practice requires a minimum of 3x physical servers in a cluster but Project Cerberus isn’t built to showcase high availability. The cluster has specifically been designed to help democratise AI, ML and HPC workloads and provide a fit for purpose platform that can handle the most intense workloads for a fraction of the cost that traditional platforms carry.

Optimisation is at the heart of everything we do at CodeZero and Project Cerberus epitomises this. Traditional platforms for AI, ML and HPC workloads will set you back between £750k ranging to £5m and this is a cost that just covers the hardware required. The power draw for the typical types of deployments is excessive which is an additional cost. On top of this outlay, businesses will need a software stack, Data Scientist, Data Engineers & Data Analysts. When you calculate all of the required elements in a typical deployment, the TCO can become extreme.

Cerberus has been designed to reduce cost overheads and democratise AI, ML & HPC for the masses, after all the benefits of these next generation workloads should be available to everyone, not just enterprise businesses. The servers have been designed on Dell PowerEdge R750’s (2U), each node consists of Platinum Intel Processors (2x 8352Y), 256GB Memory, 20TBs of NVMe SSD and NVIDIA A40 GPUs (2x). Depending on discounts each server node would cost around £30k, so for 3x nodes, the cost is circa £90k.

There are two different enterprise software stacks that can be deployed, the first is a modest stack that carries a small cost and will allow a business to run AI, ML & HPC. This consists of an Ubuntu Linux virtual machine, a Docker container, Prometheus and Grafana. The limitations with this software stack is there is no ability to create a vGPU, so each virtual machine works with GPU passthrough (a single virtual machine is pinned to a single GPU).

The second stack provides enterprise class functionality but becomes far more expensive. We use VMware vSphere to give us enterprise grade functionality on the hyper-visor and the ability to vGPU (carving up a physical GPU into smaller virtualised partitions), VMware vSAN to provide Software Defined Storage with file services, VMware Tanzu for Containers and Kubernetes Management, VMware Horizon for VDI, VMware Aria Operations for monitoring and NVIDIA AI for Enterprise Suite.

3x servers in the cluster provides us with an opportunity to flex our muscles and deploy our reverse centralised Microservices concept. 1x node can provide GPU based model training, 1x node can provide inference (post model training) and 1x node can provide VDI’s for the Data Scientists and Data Engineers. The beauty about Cerberus is the server cluster has natively been designed for Immersion Cooling and forms a strategic collaboration between GRC, Castol ON (part of bp Castrol) and one of the largest data centre colocation providers in the world, Evoque-Cyxtera.