Close Menu
Breaking News in Technology & Business – Tech Geekwire

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025
    Facebook X (Twitter) Instagram
    Breaking News in Technology & Business – Tech GeekwireBreaking News in Technology & Business – Tech Geekwire
    • New
      • Amazon
      • Digital Health Technology
      • Microsoft
      • Startup
    • AI
    • Corporation
    • Crypto
    • Event
    Facebook X (Twitter) Instagram
    Breaking News in Technology & Business – Tech Geekwire
    Home » Optimizing AI Implementation Costs: A Case Study with Automat-it
    AI

    Optimizing AI Implementation Costs: A Case Study with Automat-it

    techgeekwireBy techgeekwireMarch 7, 2025No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email

    Optimizing AI Implementation Costs with Automat-it

    As organizations embrace artificial intelligence (AI) and machine learning (ML), they are leveraging these technologies to enhance processes and products. AI applications span various domains, including video analytics, market predictions, fraud detection, and natural language processing. These applications rely on models that efficiently analyze data. The models often achieve remarkable accuracy and low latency, but they require substantial computational resources, mainly GPUs, for inference. This makes maintaining a balance between performance and cost vital, especially when deploying models at scale. Automat-it, an AWS Premier Tier Partner, specializes in helping startups and scaleups manage these challenges through cloud DevOps, MLOps, and FinOps services.

    One of Automat-it’s clients, a company developing AI models for video intelligence solutions, faced this exact challenge. The collaboration focused on achieving scalability and performance while optimizing costs. Their platform required highly accurate models with low latency, and therefore, costs escalated quickly without proper optimization. In this post, Claudiu Bota, Oleg Yurchenko, and Vladyslav Melnyk of Automat-it explain how they helped this client achieve significant cost savings while maintaining AI model performance by carefully tuning architecture, algorithm selection, and infrastructure management.

    Claudiu_Bota
    Claudiu_Bota
    Oleg_Yurchenko
    Oleg_Yurchenko
    Vladyslav_Melnyk
    Vladyslav_Melnyk

    Customer Challenge

    The client specialized in developing AI models for video intelligence using YOLOv8 and the Ultralytics library. An end-to-end YOLOv8 deployment consists of three stages:

    • Preprocessing: Prepares raw video frames through resizing, normalization, and format conversion.
    • Inference: The YOLOv8 model generates predictions by detecting and classifying objects in the curated video frames.
    • Postprocessing: Predictions are refined using techniques such as non-maximum suppression (NMS), filtering, and output formatting.

    Initially, each model ran on a dedicated GPU at runtime, requiring separate GPU instances per customer. This setup led to underutilized resources and high operational costs. The primary objective, therefore, was to optimize GPU utilization, reduce overall costs, and maintain minimal data processing time. Specifically, the goal was to limit AWS infrastructure costs to $30 per camera per month while keeping the total processing time (preprocessing, inference, and postprocessing) under 500 milliseconds. Achieving these cost savings without compromising model performance—particularly maintaining low inference latency—was essential to providing the desired level of service for each customer.

    Initial Approach: Client-Server Architecture

    The initial approach used a client-server architecture, splitting the YOLOv8 deployment into two components:

    • Client component: Running on CPU instances, handling preprocessing and postprocessing.
    • Server component: Running on GPU instances, dedicated to inference and responding to client requests.

    This functionality was implemented using a custom gRPC wrapper to provide efficient communication. The aim was to reduce costs by using GPUs only for inference. Additionally, the team assumed that client-server communication latency would have a minimal impact.

    Performance tests were conducted with these baseline parameters:

    • Inference was performed on g4dn.xlarge GPU-based instances.
    • The customer’s models used the YOLOv8n model with Ultralytics version 8.2.71.

    The results were evaluated based on these key performance indicators (KPIs):

    • Preprocessing time
    • Inference time
    • Postprocessing time
    • Network communication time
    • Total time

    The findings were as follows:

    The GPU-based instance completed inference in 7.9 ms. However, the network communication overhead increased the total processing time. While the total processing time was acceptable, having a dedicated GPU instance for each model led to unacceptable costs: $353.03 per camera monthly, exceeding the budget.

    Finding a Better Solution: GPU Time-Slicing

    While the initial results were promising, costs were still too high. The custom gRPC wrapper also lacked an automatic scaling mechanism and required ongoing maintenance. To address these challenges, the team moved away from the client-server approach and implemented GPU time-slicing, which involves dividing GPU access into discrete time intervals. This approach allows AI models to share a single GPU, each using a virtual GPU during its assigned slice, similar to CPU time-slicing between processes.

    This approach was inspired by several AWS blog posts. GPU time-slicing was implemented in the EKS cluster using the NVIDIA Kubernetes device plugin, enabling the use of Kubernetes’s native scaling mechanisms. This simplified the scaling process and reduced operational overhead. In this configuration, the GPU instance was set to split into 60 time-sliced virtual GPUs. These tests maintained the initial performance KPIs.

    The following sections describe the stages of testing.

    Stage 1: Single Pod Performance

    In this stage, one pod ran on a g4dn.xlarge GPU-based instance. The pod ran all three phases of the YOLOv8 deployment and processed video frames from a single camera. The results are shown below.

    1_ML_pod_performance
    1_ML_pod_performance

    The team achieved an inference time of 7.8 ms and a total processing time of 10.8 ms, which aligned with the project’s requirements. GPU memory usage for a single pod was 247MiB, and GPU processor utilization was 12%. The memory usage indicated that approximately 60 processes (or pods) could run on a 16GiB GPU.

    Stage 2: Multiple Pods on a Single GPU

    In this stage, 20 pods ran on a g4dn.2xlarge GPU-based instance (the instance type was changed from g4dn.xlarge to g4dn.2xlarge due to CPU overload). The results are shown below.

    20_ML_pods_performance
    20_ML_pods_performance

    GPU memory usage reached 7,244 MiB, with GPU processor utilization peaking between 95% and 99%. A total of 20 pods utilized half of the GPU’s 16 GiB memory and fully consumed the GPU processor, leading to increased processing times. This outcome was anticipated and deemed acceptable, and the next objective was determining the maximum number of pods the GPU could support.

    Stage 3: Maximizing GPU Utilization

    The goal was to run 60 pods on a g4dn.2xlarge GPU-based instance (the instance type was changed to maximize memory utilization), but the data processing and loading overloaded the instance’s CPU. The team then switched to instances that had one GPU but offered more CPUs. The results are shown below.

    54_ML_pods_performance
    54_ML_pods_performance

    GPU memory usage was 14780MiB, and GPU processor utilization was 99–100%. Despite these adjustments, GPU out-of-memory errors prevented scheduling all 60 pods. Ultimately, the team could accommodate 54 pods, the maximum number of AI models able to fit on a single GPU. In this scenario, the inference costs per camera associated with GPU usage were $27.81 per month per camera, a twelvefold reduction compared to the initial approach. This approach helped to successfully meet the customer’s cost requirements while maintaining acceptable performance.

    Conclusion

    Automat-it helped a customer using YOLOv8-based AI models achieve a twelvefold cost reduction while maintaining performance. GPU time-slicing enabled the maximum number of AI models to operate efficiently on a single GPU, significantly reducing costs. Furthermore, this method necessitates minimal maintenance and modifications to the model code, enhancing scalability and ease of use.

    References

    To learn more, refer to the following resources:

    • AWS GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances
    • Delivering video content with fractional GPUs in containers on Amazon EKS
    • Community Time-Slicing GPUs in Kubernetes

    About the Authors

    • Claudiu Bota is a Senior Solutions Architect at Automat-it, helping customers across the entire EMEA region migrate to AWS and optimize their workloads. He specializes in containers, serverless technologies, and microservices, focusing on building scalable and efficient cloud solutions.
    • Oleg Yurchenko is the DevOps Director at Automat-it, where he spearheads the company’s expertise in DevOps best practices and solutions. His focus areas include containers, Kubernetes, serverless, Infrastructure as Code, and CI/CD.
    • Vladyslav Melnyk is a Senior MLOps Engineer at Automat-it. He is a seasoned Deep Learning enthusiast with a passion for Artificial Intelligence, taking care of AI products through their lifecycle, from experimentation to production. With over 9 years of experience in AI within AWS environments, he is also a big fan of leveraging cool open-source tools.
    AI AWS EKS FinOps GPU optimization Machine Learning MLOps YOLOv8
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    techgeekwire
    • Website

    Related Posts

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025

    Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

    July 4, 2025

    Invesco QQQ ETF Hits All-Time High as Tech Stocks Continue to Soar

    July 4, 2025

    ContractPodAi Partners with Microsoft to Advance Legal AI Automation

    July 4, 2025
    Leave A Reply Cancel Reply

    Top Reviews
    Editors Picks

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025

    Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

    July 4, 2025
    Advertisement
    Demo
    About Us
    About Us

    A rich source of news about the latest technologies in the world. Compiled in the most detailed and accurate manner in the fastest way globally. Please follow us to receive the earliest notification

    We're accepting new partnerships right now.

    Email Us: info@example.com
    Contact: +1-320-0123-451

    Our Picks

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025
    Categories
    • AI (2,696)
    • Amazon (1,056)
    • Corporation (990)
    • Crypto (1,130)
    • Digital Health Technology (1,079)
    • Event (523)
    • Microsoft (1,230)
    • New (9,568)
    • Startup (1,164)
    © 2025 TechGeekWire. Designed by TechGeekWire.
    • Home

    Type above and press Enter to search. Press Esc to cancel.