Well-architected framework (WAF) on Azure
Designed for resilient, efficient cloud workloads
Hey everyone, the Azure OpenAI Service offers a powerful platform where you can embed advanced AI capabilities into applications. Understanding various deployment types and implications on resiliency, availability, and performance is, therefore, crucial. In this blog post, we dive deep into the relevant topics and do not miss offering Terraform code examples for the shown approaches.
When working with generative models using Azure OpenAI, it is very important to consider that these models are stateless. Each call to the API is distinct and the model does not retain any session or history of conversation.
Take, for example keeping up a conversation. Every call your application makes to the API, it needs to be provided with the whole conversation history.
When you create an Azure OpenAI resource, it is bound to one Azure region, for example, East US and West Europe. This implies:
resource "azurerm_resource_group" "rg" {
name = "ai-rg"
location = "eastus2"
}
resource "azurerm_cognitive_account" "ai_ca" {
name = "ai-ca"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
kind = "OpenAI"
sku_name = "S0"
}
In each region, there are capacity pools running specific models and versions. Capacity pools are groups of instances running on GPUs serving the inference workloads.
Azure OpenAI provides Responsible AI layers to ensure compliance and prevent abuse.
When you deploy a model, you have choises about where your interface requests can be processed.
resource "azurerm_cognitive_deployment" "ai_cd" {
name = "ai-opeinai-cd"
cognitive_account_id = azurerm_cognitive_account.ai_ca.id
model {
format = "OpenAI"
name = "gpt-4o-realtime-preview"
version = "2024-10-01"
}
sku {
name = "GlobalStandard"
}
}
The Azure OpenAI uses intelligent rounting to make sure optmization around performance is achieved:
Data Zones let you balance between being globally available and, at the same time, needs to satisfy data residency requirements.
Even with deployments being global, or at least at the level of the data zone, your Azure OpenAI resource is still a single point of failure within its region.
With APIM, you’re able to expose an interface endpoint for all your applications.
resource "azurerm_virtual_network" "vnet" {
name = "ai-opeinai-vnet"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
address_space = var.vnet_address_space
depends_on = [azurerm_resource_group.rg]
}
resource "azurerm_subnet" "apim_subnet" {
name = "apim-subnet"
resource_group_name = azurerm_resource_group.rg.name
virtual_network_name = azurerm_virtual_network.vnet.name
address_prefixes = [var.apim_subnet_prefix]
}
resource "azurerm_subnet" "cognitive_services_subnet" {
name = "cogsvc-subnet"
resource_group_name = azurerm_resource_group.rg.name
virtual_network_name = azurerm_virtual_network.vnet.name
address_prefixes = [var.cognitive_services_subnet_prefix]
}
resource "azurerm_public_ip" "apim_public_ip" {
name = "apim-public-ip"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
allocation_method = "Static"
sku = "Standard"
domain_name_label = "apim-oaigw"
}
resource "azurerm_api_management" "apim" {
name = "apim"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
publisher_name = "your-name"
publisher_email = "your-email"
sku_name = "Premium_1"
public_ip_address_id = azurerm_public_ip.apim_public_ip.id
identity {
type = "SystemAssigned"
}
virtual_network_configuration {
subnet_id = azurerm_subnet.apim_subnet.id
}
virtual_network_type = "External"
lifecycle {
ignore_changes = [
hostname_configuration
]
}
depends_on = [azurerm_subnet.apim_subnet]
}
resource "azurerm_api_management_api" "api" {
name = "openai-api"
resource_group_name = azurerm_resource_group.rg.name
api_management_name = azurerm_api_management.apim.name
revision = "1"
display_name = "OpenAI API"
path = ""
protocols = ["https"]
subscription_required = false
depends_on = [azurerm_api_management.apim]
import {
content_format = "openapi-link"
content_value = "link"
}
}
resource "azurerm_api_management_named_value" "tenant_id" {
name = "tenant"
resource_group_name = azurerm_resource_group.rg.name
api_management_name = azurerm_api_management.apim.name
display_name = "tenant"
value = data.azurerm_subscription.current.tenant_id
}
resource "azurerm_api_management_api_policy" "policy" {
api_name = azurerm_api_management_api.api.name
api_management_name = azurerm_api_management.apim.name
depends_on = [azurerm_api_management_api.api, azurerm_api_management_named_value.tenant_id, azurerm_api_management_logger.logger]
resource_group_name = azurerm_resource_group.rg.name
xml_content = <<XML
<policies>
</policies>
XML
}
Prompt caching in Azure OpenAI reduces the number of tokenizations done to a cached version, hance cost-optimizing it. This is because when a repeated prompt exceeds ~1,000 tokens, you’re able to reuse the result of the tokenization.
PTUs let you home to Get Capacity Reservations and consequently to achieve consistent performance.
If you plan to use PTUs over longer timeframes, you can make reservations to reduce costs.
The Batch deployment type is ideal for those inference jobs that aren’t time-sensitive.
Understanding all the deployment options inside Azure OpenAI will be essential to optimal performance, ensuring compliance, and keeping control of your costs. Here is a quick rundown: