Scale AI application in production: Build a fault-tolerant AI gateway with SnapSoft

AWS
Scale AI application in production: Build a fault-tolerant AI gateway with SnapSoft

By: Gergely Szlobodnyik, Head of AI & ML – SnapSoft By: Lu Zou, Sr. WW Partner Solutions Architect – AWS

Production AI applications face unique scaling challenges. When you build generative AI applications, you integrate multiple AI models from various providers. Each provider sets model-specific quotas and regional endpoints. Under unpredictable workloads, your architecture must handle thousands of concurrent requests without service disruptions.This post introduces an AI gateway solution. The AI gateway creates a resilient, fault-tolerant architecture by routing requests across multiple providers, accounts, and AWS Regions. You can scale confidently without hitting quota limits or experiencing provider outages.

You’ll learn the architecture at multiple layers:

Single-provider architectures create critical risks. Most organizations start with one AI provider and endpoint. This approach works for prototypes but fails at scale. Consider these common challenges:

These challenges require a new architectural approach. The AI gateway addresses each challenge through multi-provider routing, automatic failover, and distributed quota management.

The AI gateway uses a multistep routing mechanism. The gateway selects the best model provider, AI model, account, and regional endpoint for each request. This dynamic routing means your requests reach available endpoints, even when quotas are exceeded or endpoints fail.

Figure 1 shows the routing mechanism. The gateway routes requests through four selection steps:

AI gateway routing mechanism

The AI gateway uses a hub-and-spoke architecture. Each hub connects to a specific model provider: Amazon Bedrock, third-party APIs, or on-premises models. All hubs share a single entry point and DNS routing mechanism.

Amazon API Gateway serves as the single entry point. You can configure authentication, authorization, and rate limits through usage plans. An AWS Lambda function acts as the gateway proxy and routes requests. The function uses provisioned concurrency for high availability and connects to a virtual private cloud (VPC) for secure network access. The function can also filter content before forwarding requests.

A private hosted zone in Amazon Route 53 enables dynamic routing. The zone contains weighted alias records that point to Network Load Balancer endpoints for each provider hub. The Lambda function queries Amazon Route 53 resolver for available endpoints. The resolver returns only healthy Network Load Balancer IP addresses. Health checks run every 10 seconds to provide fast failover. The weighted alias records distribute requests uniformly across all healthy hubs with low time to live (TTL) (10 seconds) for fast DNS convergence.

Each provider has a dedicated hub-and-spoke module. The hub performs load balancing, health checking, and automatic failover. Spokes represent independent AWS accounts with separate quota limits. Each hub uses a Network Load Balancer that routes traffic to spoke accounts through AWS Transit Gateway. The Network Load Balancer performs health checks on spoke endpoints using the /healthz endpoint.

Each spoke contains a Network Load Balancer with static IP addresses and an AWS Fargate service. The Fargate service hosts the containerized application that forwards requests to AI endpoints. The service auto scales based on incoming request volume. The application provides two endpoints: /healthz for health checks and /inference for forwarding requests. The application includes backoff strategies, Regional failover, and model failover capabilities.

The architecture supports three provider types:

AI gateway architecture

Although this diagram focuses on core components, a production deployment should include additional security controls. Implement authentication using Amazon Cognito at the API gateway, enforce TLS encryption across all communications, and apply rate limiting with AWS WAF to prevent abuse. Store third-party API keys in AWS Secrets Manager. For AI-specific security controls such as content filtering and prompt injection prevention, implement Amazon Bedrock Guardrails. Finally, enable Amazon CloudWatch and AWS CloudTrail for monitoring and incident detection.

GDE-MIT builds an EdTech platform that helps thousands of students learn through AI-powered chat. To use the platform, schools upload documents to create knowledge bases and students ask questions about subjects, lectures, and books. The platform ran on Microsoft Azure with a single OpenAI endpoint. This architecture created critical risks. When OpenAI experienced outages, the entire service went down. The single endpoint couldn’t handle production-scale traffic, and quota limits caused frequent throttling. GDE-MIT needed a resilient, scalable solution before expanding to more schools.

SnapSoft assessed the architecture and identified critical gaps by mapping GDE-MIT’s IT landscape and analyzing the data and AI architecture. The assessment identified inefficiencies in business continuity, fault tolerance, latency, and cost. SnapSoft recommended migrating to AWS and implementing the AI gateway solution to address these gaps.

The AI gateway eliminated single points of failure. The gateway routes requests across Amazon Bedrock, third-party APIs, and on-premises models. When one provider fails or reaches quota limits, traffic automatically fails over to available endpoints. The solution distributes load across multiple AWS accounts with independent quotas. The platform now achieves:

The AI gateway allows GDE-MIT to scale confidently while maintaining continual availability and controlling costs.

The AI gateway solves the critical challenges of production AI applications. The multi-provider architecture eliminates single points of failure by routing requests across providers, accounts, and Regions. When quota limits are reached or endpoints fail, automatic failover occurs in seconds. Health checks run continually to detect and route around unhealthy endpoints. You can combine the Amazon Bedrock low-latency backbone network with third-party APIs and cost-efficient on-premises models.

Key benefits:

You can scale to serve millions of users without manual quota management or service disruptions. The architecture delivers the high availability and resilience your production applications require. To learn more about implementing the AI gateway architecture described in this post, contact the SnapSoft team. Their experts can help you assess your current AI architecture and design a fault-tolerant gateway tailored to your production requirements.

New to AWS? Become an AWS Partner to build, market, grow and scale your business.

. .

Connect with SnapSoft

SnapSoft is an AWS Premier Tier and AWS Competency Partner that excels in cloud migrations, GenAI, DevOps, and application development, enabling startups, SMBs, and enterprises to seamlessly deploy AI, transition to AWS, and improve DevOps speed while driving innovation, reducing costs, and enhancing security.

Contact SnapSoft | Partner Overview | AWS Marketplace

Originally published on AWS.