Skip to content

AI Gateway: LLM Provider Failover Procedure

This runbook describes how to quickly failover from one LLM provider to another in response to an incident (e.g. an outage or severe degradation of a specific LLM provider).

The AI Gateway selects which model to use for each feature based on the default_models configuration in unit_primitives.yml. This configuration is baked into the deployed image, so changing it normally requires a new MR and a full deployment cycle (~30–60 minutes).

However, the AIGW_MODEL_SELECTION__DEFAULT_MODELS environment variable (introduced in ai-assist!5498) allows overriding the default models at runtime via Vault, without a code change. Combined with the ability to trigger a fast redeployment by retrying a deploy job, this enables a failover in approximately 2–5 minutes.

Follow the Runway secrets management documentation to log into Vault and update the AIGW_MODEL_SELECTION__DEFAULT_MODELS secret.

Note: Vault access is provisioned through Okta via an IT access request if you do not already have it.

Use the links below to navigate directly to the relevant Vault secret store for each service and environment:

ServiceStagingProduction
AIGWStagingProduction
DWSStagingProduction

The value must be a JSON object mapping feature_setting names to a list of model identifiers.

Format:

AIGW_MODEL_SELECTION__DEFAULT_MODELS='{"<feature_setting>": ["<model_identifier>", ...], ...}'

Example — failing over duo_chat and code_generations from Vertex AI to AWS Bedrock:

AIGW_MODEL_SELECTION__DEFAULT_MODELS='{"duo_chat": ["claude_sonnet_4_6_bedrock"], "code_generations": ["claude_sonnet_4_6_bedrock"]}'

Note: Some feature_setting entries may have multiple default_models. In that case you can set the override to contain all those models except the one whose provider is experiencing an incident. For example, if the configuration in unit_primitives.yml is

# ...
- feature_setting: "duo_chat"
default_models:
- "claude_sonnet_4_6_vertex"
- "claude_sonnet_4_6_bedrock"
- "claude_sonnet_4_6"
# ...

… and Bedrock is having an incident, you can set the override to

AIGW_MODEL_SELECTION__DEFAULT_MODELS='{"duo_chat": ["claude_sonnet_4_6_vertex", "claude_sonnet_4_6"]}'

Important: Only include the feature settings you want to override. Omitting a feature setting means it continues to use the default from unit_primitives.yml.

Track the Vault change via a production change request for traceability.

  1. Enable expedited runway deployments.
  2. Go to the AI Gateway security mirror environments page.
  3. Find the latest deployment pipeline (the one currently serving production).
    • Critical: You must use the latest pipeline. Retrying an older pipeline will deploy an older image, which could revert previously deployed features or security fixes.
    • Verify this is the latest by checking the commit SHA against the commits page.
  4. Click on the pipeline link to open it.
  5. Find the Deploy [production] job (or the relevant environment’s deploy job) and click Retry.
  6. The redeployment will pick up the updated Vault secret and apply the new model selection configuration.

Step 3: Restore the original configuration

Section titled “Step 3: Restore the original configuration”

Once the provider incident is resolved:

  1. Disable expedited deployments
  2. Remove the keys added in the previous steps to AIGW_MODEL_SELECTION__DEFAULT_MODELS (there might be preexisting overrides you should keep).

A fast redeployment is usually not needed at this stage, and you can just wait for the next deployment pipeline for the changes to go into effect.