Optimizing AI Implementation Costs: A Case Study with Automat-it

Optimizing AI Implementation Costs with Automat-it

As organizations embrace artificial intelligence (AI) and machine learning (ML), they are leveraging these technologies to enhance processes and products. AI applications span various domains, including video analytics, market predictions, fraud detection, and natural language processing. These applications rely on models that efficiently analyze data. The models often achieve remarkable accuracy and low latency, but they require substantial computational resources, mainly GPUs, for inference. This makes maintaining a balance between performance and cost vital, especially when deploying models at scale. Automat-it, an AWS Premier Tier Partner, specializes in helping startups and scaleups manage these challenges through cloud DevOps, MLOps, and FinOps services.

One of Automat-it’s clients, a company developing AI models for video intelligence solutions, faced this exact challenge. The collaboration focused on achieving scalability and performance while optimizing costs. Their platform required highly accurate models with low latency, and therefore, costs escalated quickly without proper optimization. In this post, Claudiu Bota, Oleg Yurchenko, and Vladyslav Melnyk of Automat-it explain how they helped this client achieve significant cost savings while maintaining AI model performance by carefully tuning architecture, algorithm selection, and infrastructure management.

Customer Challenge

The client specialized in developing AI models for video intelligence using YOLOv8 and the Ultralytics library. An end-to-end YOLOv8 deployment consists of three stages:

Preprocessing: Prepares raw video frames through resizing, normalization, and format conversion.
Inference: The YOLOv8 model generates predictions by detecting and classifying objects in the curated video frames.
Postprocessing: Predictions are refined using techniques such as non-maximum suppression (NMS), filtering, and output formatting.

Initially, each model ran on a dedicated GPU at runtime, requiring separate GPU instances per customer. This setup led to underutilized resources and high operational costs. The primary objective, therefore, was to optimize GPU utilization, reduce overall costs, and maintain minimal data processing time. Specifically, the goal was to limit AWS infrastructure costs to $30 per camera per month while keeping the total processing time (preprocessing, inference, and postprocessing) under 500 milliseconds. Achieving these cost savings without compromising model performance—particularly maintaining low inference latency—was essential to providing the desired level of service for each customer.

Initial Approach: Client-Server Architecture

The initial approach used a client-server architecture, splitting the YOLOv8 deployment into two components:

Client component: Running on CPU instances, handling preprocessing and postprocessing.
Server component: Running on GPU instances, dedicated to inference and responding to client requests.

This functionality was implemented using a custom gRPC wrapper to provide efficient communication. The aim was to reduce costs by using GPUs only for inference. Additionally, the team assumed that client-server communication latency would have a minimal impact.

Performance tests were conducted with these baseline parameters:

Inference was performed on g4dn.xlarge GPU-based instances.
The customer’s models used the YOLOv8n model with Ultralytics version 8.2.71.

The results were evaluated based on these key performance indicators (KPIs):

Preprocessing time
Inference time
Postprocessing time
Network communication time
Total time

The findings were as follows:

The GPU-based instance completed inference in 7.9 ms. However, the network communication overhead increased the total processing time. While the total processing time was acceptable, having a dedicated GPU instance for each model led to unacceptable costs: $353.03 per camera monthly, exceeding the budget.

Finding a Better Solution: GPU Time-Slicing

While the initial results were promising, costs were still too high. The custom gRPC wrapper also lacked an automatic scaling mechanism and required ongoing maintenance. To address these challenges, the team moved away from the client-server approach and implemented GPU time-slicing, which involves dividing GPU access into discrete time intervals. This approach allows AI models to share a single GPU, each using a virtual GPU during its assigned slice, similar to CPU time-slicing between processes.

This approach was inspired by several AWS blog posts. GPU time-slicing was implemented in the EKS cluster using the NVIDIA Kubernetes device plugin, enabling the use of Kubernetes’s native scaling mechanisms. This simplified the scaling process and reduced operational overhead. In this configuration, the GPU instance was set to split into 60 time-sliced virtual GPUs. These tests maintained the initial performance KPIs.

The following sections describe the stages of testing.

Stage 1: Single Pod Performance

In this stage, one pod ran on a g4dn.xlarge GPU-based instance. The pod ran all three phases of the YOLOv8 deployment and processed video frames from a single camera. The results are shown below.

The team achieved an inference time of 7.8 ms and a total processing time of 10.8 ms, which aligned with the project’s requirements. GPU memory usage for a single pod was 247MiB, and GPU processor utilization was 12%. The memory usage indicated that approximately 60 processes (or pods) could run on a 16GiB GPU.

Stage 2: Multiple Pods on a Single GPU

In this stage, 20 pods ran on a g4dn.2xlarge GPU-based instance (the instance type was changed from g4dn.xlarge to g4dn.2xlarge due to CPU overload). The results are shown below.

GPU memory usage reached 7,244 MiB, with GPU processor utilization peaking between 95% and 99%. A total of 20 pods utilized half of the GPU’s 16 GiB memory and fully consumed the GPU processor, leading to increased processing times. This outcome was anticipated and deemed acceptable, and the next objective was determining the maximum number of pods the GPU could support.

Stage 3: Maximizing GPU Utilization

The goal was to run 60 pods on a g4dn.2xlarge GPU-based instance (the instance type was changed to maximize memory utilization), but the data processing and loading overloaded the instance’s CPU. The team then switched to instances that had one GPU but offered more CPUs. The results are shown below.

GPU memory usage was 14780MiB, and GPU processor utilization was 99–100%. Despite these adjustments, GPU out-of-memory errors prevented scheduling all 60 pods. Ultimately, the team could accommodate 54 pods, the maximum number of AI models able to fit on a single GPU. In this scenario, the inference costs per camera associated with GPU usage were $27.81 per month per camera, a twelvefold reduction compared to the initial approach. This approach helped to successfully meet the customer’s cost requirements while maintaining acceptable performance.

Conclusion

Automat-it helped a customer using YOLOv8-based AI models achieve a twelvefold cost reduction while maintaining performance. GPU time-slicing enabled the maximum number of AI models to operate efficiently on a single GPU, significantly reducing costs. Furthermore, this method necessitates minimal maintenance and modifications to the model code, enhancing scalability and ease of use.

References

To learn more, refer to the following resources:

About the Authors

Claudiu Bota is a Senior Solutions Architect at Automat-it, helping customers across the entire EMEA region migrate to AWS and optimize their workloads. He specializes in containers, serverless technologies, and microservices, focusing on building scalable and efficient cloud solutions.
Oleg Yurchenko is the DevOps Director at Automat-it, where he spearheads the company’s expertise in DevOps best practices and solutions. His focus areas include containers, Kubernetes, serverless, Infrastructure as Code, and CI/CD.
Vladyslav Melnyk is a Senior MLOps Engineer at Automat-it. He is a seasoned Deep Learning enthusiast with a passion for Artificial Intelligence, taking care of AI products through their lifecycle, from experimentation to production. With over 9 years of experience in AI within AWS environments, he is also a big fan of leveraging cool open-source tools.

What's Hot

IEEE Spectrum: Flagship Publication of the IEEE

GOP Opposition Mounts Against AI Provision in Reconciliation Bill

Navigation Help

IEEE Spectrum: Flagship Publication of the IEEE

GOP Opposition Mounts Against AI Provision in Reconciliation Bill

Navigation Help

Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

Invesco QQQ ETF Hits All-Time High as Tech Stocks Continue to Soar

ContractPodAi Partners with Microsoft to Advance Legal AI Automation

IEEE Spectrum: Flagship Publication of the IEEE

GOP Opposition Mounts Against AI Provision in Reconciliation Bill

Navigation Help

Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

Our Picks

IEEE Spectrum: Flagship Publication of the IEEE

GOP Opposition Mounts Against AI Provision in Reconciliation Bill

Navigation Help

Subscribe to Updates

What's Hot

Optimizing AI Implementation Costs: A Case Study with Automat-it

Optimizing AI Implementation Costs with Automat-it

Customer Challenge

Initial Approach: Client-Server Architecture

Finding a Better Solution: GPU Time-Slicing

Stage 1: Single Pod Performance

Stage 2: Multiple Pods on a Single GPU

Stage 3: Maximizing GPU Utilization

Conclusion

References

About the Authors

Related Posts