7. Challenges in managing GenAI APIs
• Track Token usage across multiple applications
• Ensure single app doesn’t consume whole TPM quota
• Secure API keys across multiple applications
• Distribute load across multiple endpoints
• Ensure committed capacity in PTUs is exhausted before falling back to PAYG instance
8. Provisioned Throughput Units (PTU)
• Allows to specify the amount of throughput required in a model deployment.
• Granted to subscription as quota
• Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the
subscription and region
• PTU provides
• Predictable performance
• Allocated processing capacity
• Cost savings
Understanding costs associated with provisioned throughput units (PTU)
9. Token Metrics Emitting
• Sends Token Merics usage to Applications Insights
• Provides overview of utilization of Azure OpenAI models
across multiple applications or API consumers
GenAI Gateway Capabilities in Azure API Management
10. Token Rate Limiting
• Manage and enforce limits per API consumer based on the
usage of API Tokens
GenAI Gateway Capabilities in Azure API Management
11. Load Balancer and Circuit Breaker
• Helps to spread load across multiple Azure OpenAI endpoints
• Round-robin, weighted or priority based load distribution
strategy
GenAI Gateway Capabilities in Azure API Management
12. Semantic Caching
GenAI Gateway Capabilities in Azure API Management
• Optimize Token usage by leveraging semantic caching
• Stores completions for prompts with similar meanings
13. Summary
• Track Token usage across multiple applications
• Emit Token Metrics policy
• Ensure single app doesn’t consume whole TPM quota
• Token Limit Policy
• Secure API keys across multiple applications
• Subscription keys
• Distribute load across multiple endpoints
• Backend pool load balancing and circuit breaker