How small to mid-sized AI product companies can save $300,00+ in Cloud GPU Costs yearly

Cloud GPU Cost Savings with Kubernetes and Spot Instances
Introduction
This case study examines how small to mid-sized AI SaaS companies can successfully reduce their GPU computing costs up to ~80% by migrating their AI workloads to Kubernetes in the cloud (e.g. AWS EKS, Azure AKS, Google GKE) and leveraging GPU spot instances [3]. The migration can lead to annual savings of approx. $300,00 while maintaining the same computational performance and reliability.
In today's AI-driven world, the cost of GPU computing is a major factor in the overall product cost of many SaaS companies. Typically, they tend to use managed cloud services (e.g. AWS SageMaker, Azure ML, Google AI Platform, Azure ML Studio) to deploy their ML models. It can be assumed that a SaaS company with a product that provides AI-based features will require at least 50-100 GPU units daily to handle the peak load. Therefore, while reducing operational complexity, the cost of using managed cloud services can become a growth bottleneck, especially for small to mid-sized companies experiencing rapid growth.
Challenge
Many companies are faced with unsustainable GPU computing costs threatening their business viability:
  • Rapid growth could lead to cloud GPU infrastructure costs that exceed $20,000/month while revenue grows slower or is stagnant
  • Managed services are easy to set up but often lead to poor GPU resource utilization (e.g. 20% average) due to inefficient resource allocation
  • Major trade-off of managed services is slow cold start times that cannot efficiently handle demand peaks and spiky traffic patterns
These challenges result in bad user experience, over-provisioning of resources and increased costs. To overcome these challenges, DevLocus recommends migrating to Kubernetes in the cloud and leveraging GPU spot instances.
Proposed Solution
After a thorough analysis of the AI market, DevLocus came to the conclusion that the best solution for small to mid-sized AI SaaS companies was to migrate their AI/ML workloads to K8s in the cloud and leverage GPU spot instances. Typical spot instance costs are 80% lower than on-demand costs (e.g. NC4as T4 v3 costs $0.53/h on-demand and $0.1/h spot in Azure US East [1][2]), but they can be interrupted at any time.
Case analysis:
The following case hows how costs develop over time. An AI product company has a monthly rate of 100000 AI workload requests and an expected growth of 10% per month. The managed service costs 20 cent per query. The company considers moving to Azure K8s and using spot NC4as T4 v3 instances with 80/20 spot/on-demand split. A single GPU can process a request in 5 minutes.
Let's visualize the yearly cost considering the options:
  • Using managed services (e.g. AWS Bedrock, Azure AI services, Google AI Platform):
    0.2 USD/query
  • Using K8s in the cloud with on-demand GPU instances with a self-hosted solution:
    0.53 USD/GPUh → ~0.05 USD/query
  • Using K8s in the cloud with 80/20 spot GPU instances split (80% spot, 20% on-demand) with a self-hosted solution:
    0.1 USD/GPUh → ~0.008 USD/query
Managed Service
Total Yearly Cost

$427,685.68

Baseline cost

On-Demand GPU
Total Yearly Cost

$113,336.7

73.5% savings($314,348.97)

Spot + On-Demand
Total Yearly Cost

$39,774.77

90.7% savings($387,910.91)

The graph shows that the cost of using managed services is the highest and is rising non-linearly meaning that the cost is compounding with the growth of the company.
In comparison using K8s with on-demand GPU instances is already a good solution. The cost is approx. 73.5% lower and the growth follows more of a linear pattern.
Lastly using K8s with 80/20 spot GPU instances split is the best solution. The cost is approx. 91% lower than using managed services and the growth is linear. This results in the biggest savings and the lowest cost.
Note: The costs exclude the price for developing the self-hosted service and the cost for setting up the infrastructure.
Proposed Framework:
This migration process can be summarized in the following framework:
1. Adopt Kubernetes in the cloud
    • Develop self-hosted AI solution as a replacement for the managed service
    • Orchestrate automated GPU workload processing with autoscaling
    • Implement cloud monitoring dashboards to track costs and performance
2. Utilization of GPU Spot Instances
    • Use 80/20 spot/on-demand GPU split for optimal costs
    • Implement resilient spot interruption handling to avoid interruptions
    • Monitor workload resource usage in real time and find parallelization opportunities.
    • Iterate the development of the self-hosted solution by improving the performance with the monitoring feedback.
3. Workload Optimization and Scheduling
    • Use priority-based smart autoscaling to optimize costs and performance
    • Implement scale down to zero during idle periods to avoid unnecessary costs
    • Create an efficient pod distribution to optimize resource utilization in multi-stage pipelines.

Suggested Implementation Process

The migration of existing AI workloads can be progressively rolled out in phases. Depending on the complexity of the AI workloads and the maturity of the development team, the migration can be completed in 3-6 months. Here is a suggested implementation process:

Phase 1: Assessment and Planning

  • Profiling the existing AI workloads
  • Identifying pipeline bottlenecks and inefficiencies
  • Rewriting code for better computational efficiency and containerization for Kubernetes

Phase 2: Infrastructure Setup

  • Designing and deploying the Kubernetes cluster in the cloud provider (Compute, Networking, Storage, Access, and Autoscaling)
  • Automating NVIDIA driver setup and resource allocation for the GPU worker instances
  • Configuring node groups with mixed instance policies
  • Implementing spot interruption handling procedures

Phase 3: Workload Migration

  • Implementing user request flow adapters in the existing infrastructure to handle the transition between the old and new setup
  • Configuring load balancing across the old and new setup
  • Rolling out the new setup to a subset of the users and addressing performance issues
  • Creating performance dashboards to monitor the cluster and workload performance

Phase 4: Optimization and Testing

  • Observing the new system's performance in real-time to analyze possible misconfigurations
  • Rolling out updates to the system to improve performance and reliability
  • Fine-tuning auto-scaling policies to meet cost and performance goals
  • Stress-testing system performance and interruption handling with extrapolated future usage scenarios

Phase 5: Full Deployment and Monitoring

  • Documenting best practices and cluster setup for future reference
  • Gradually transitioning the production traffic to the new setup completely

🚀 Ready to reduce your AI service costs?

Schedule a consultation and get a detailed assessment.

Results

The migration of the AI workloads to the new infrastructure is expected to result in the following benefits:

Cost Savings

  • 80% reduction in GPU computing costs on a monthly basis
  • Annual savings up to 90% of the costs of the managed service (self-hosted solution K8s solution with 80/20 spot split usage)
  • Additional savings from improved resource utilization

Performance Improvements

  • Average GPU utilization increased from 20% to 80% (assuming usage of typical managed services and optimized K8s setups)
  • Up to 35% increase in total computational capacity within the same budget
  • Average job completion speed improved by up to 25% due to better resource allocation
Example for the improvements are companies such as Omni, which reduced their costs by 70% while speeding up render time and scaling seamlessly up to 1000 GPU instances [4].

Key Learnings

  • Spot instance diversification is critical: Using a mix of on-demand and spot instances significantly reduces cost while maintaining performance.
  • Graceful interruption handling pays off: Investing in a graceful interruption handling mechanism on the application side reduces the risk of data loss and improves system reliability, even when using spot instances.
  • Right-sizing matters: Carefully matching instance types to workload requirements improves both cost efficiency and performance.
  • Monitoring drives optimization: Real-time visibility of cluster costs, utilization, and performance allows for continuous improvement of the infrastructure.

Conclusion

Migrating from a managed AI service (e.g. AWS SageMaker, Azure ML, Google AI Platform) to a Kubernetes cluster in the cloud is an operationally demanding task. However, it is the best way for organizations to significantly reduce GPU computing costs while maintaining or improving performance. The combination of containerization, orchestration, spot instances, and workload optimization can be used to create a flexible and cost-effective infrastructure well-suited to AI/ML workloads. By carefully planning the migration and implementing robust handling of spot instance characteristics, DevLocus can transform a cost challenge into a competitive advantage, allowing the client to allocate more resources to core product development rather than infrastructure costs.

References

We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.