Code Zero Group

To speed things up we have cut the amount of steps in training considerably but the model still completes 1 epoch. We can select questions from our structured database ask our chatbot some different questions and we should receive, good, coherent answers. To keep things simple we ask our chatbot 2 questions that we select at random from the beginning of our dataset . We get great responses, but we also get the responses very quickly, this is a great user experience. Now its time to flip over to our AWS cluster. Everything is already in place for us, so we can run our commands from the jump box.

We have made some slight improvements to the EKS cluster by adding a second worker node, this of course adds more cost. Due to our vCPU limits in AWS and GPT in a Box only requiring 1x worker node with a GPU, we opted for a large c instance we high vCPU and memory outputs. We will ask the same questions to our AWS chat bot and see if we get similar answers to our questions. It will also be interesting to see if the response times are the same, we would expect them to differ due to the fact we are running legacy GPUs one end and more modern GPUs at the other. We have chosen Eu West region for our AWS VPC location and our Immersion cluster is hosted not too far away in Reading. Clearly our chatbot is having to think very hard about the answer, or could it be what we have just mentioned in terms of GPU variety. We have some limitations on this demo as the P5 instance with 8x A100s on a 600GB backbone with NVlink is a bit over kill for this demo.

The P5 instance in AWS would be epic for a LLM R&D and prod environment and would be absolutely rapid. As expected, response times differ. The reason that the responses are slower in AWS is that we are using a T4 16GB memory GPU whereas at the private cloud instance we are running much quicker A40s. The A40s more modern with Ampere technology which makes them ideal for virtualisation, they have more memory and the memory is a later generation. The good news is however, that our T4 GPU can handle the load and deliver us back good answers. The answers slightly differ which is not surprising considering different GPUs at either end and that we cut model training time to speed things up.

So for anyone that has never seen what it looks like to train a machine learning model, get ready. We have deliberately sped this portion up as depending on the number of steps, number of epochs, codebase, scripts and datasets, our model training for the Meta Llama 3 model could take anywhere between 24 hours to 44 days running on a single A40. Something to be aware of is that when using GPU passthrough you can only pin a single workload to a single GPU, whilst that workload, container of vm will get 100% of the GPU power, it is still a limitation. What we could do to super charge this, would be to use NVIDIA AI for Enterprise so we can vGPU our other 5 physical A40s in the server cluster and map them to our workload. Another thing to be aware of is the python codebase, not all data scientists have decades of coding experience like our staff, a particular codebase will only ever use amount of GPU power that it has been coded to do so. We often take code from public domains and review it to see just how much of our GPU is being utilised. We also run Prometheus to track the performance metrics of the GPU and push the time series data into Grafana for data visualisation. Once we have run a codebase and benchmarked the model training, we can look at making a variety of optimisation improvements therefore improving resource utilisation and time to value. Now we have fine tuned our model training by retraining the model on a larger portion of the dataset on our immersion cluster. We can push the updated and finetuned model back to our inference edge.

AWS acts as the edge deployment and depending on where this chatbot needs to be deployed, AWS can use its own native CDN for an edge consistent experience, a customer can improve this by using an SD WAN overlay and prioritise as well as secure this traffic to the user. If you are feeling really expansive you can use something like F5 distributed cloud services which was formerly Voltera to deploy an application delivery network. So its time to ask our chatbot some new questions which it previously couldn’t provide accurate answers to. We are going to experience the same response time as before but as you can see, our chatbot is performing very well. If we ask another question, we will get an accurate answer. If we wanted to improve this deployment we could look at using Move by Nutanix to migrate the model to AWS or we could deploy Kubeflow which would allow us to run MLOps pipelines on Kubernetes and create a very slick 1 click pipeline delivery tool to supercharge your chatbot deployments.

GPT in a Box - Hybrid Cloud Immersion

CONTACT US