Back to all terms
ServerNode 1Node 2Node 3Infrastructure
Infraadvanced

Canary Releases

Gradually roll out changes to a small subset of users before full deployment to detect issues with minimal blast radius.

Also known as: canary deployment, canary rollout, progressive delivery, incremental rollout

Description

Canary releases involve deploying a new application version to a small percentage of production traffic (the 'canary' group) while the majority continues to be served by the current stable version. If the canary version performs well according to predefined metrics (error rates, latency percentiles, business KPIs), traffic is progressively shifted until the new version handles 100% of traffic. If issues are detected, traffic is automatically routed back to the stable version.

Traffic splitting can be implemented at various layers: load balancer weighted target groups, service mesh traffic routing (Istio, Linkerd), Kubernetes ingress annotations, or CDN-level routing. The percentage can follow a linear progression (5% -> 25% -> 50% -> 100%) or an exponential one (1% -> 5% -> 20% -> 50% -> 100%), with automated analysis at each step. Tools like Flagger, Argo Rollouts, and AWS AppConfig provide automated canary analysis with customizable metrics and rollback thresholds.

Effective canary releases require robust observability to compare canary vs. baseline metrics. Key comparison metrics include HTTP error rates, p50/p95/p99 latency, CPU and memory utilization, and application-specific business metrics (conversion rates, API success rates). Statistical significance testing ensures that observed differences are real and not due to normal variance. Canary releases are complementary to feature flags -- canaries control which version of the code runs, while feature flags control which features within that code are enabled.

Prompt Snippet

Configure canary deployments using Argo Rollouts with a progressive traffic shift strategy: 5% for 5 minutes, 20% for 5 minutes, 50% for 10 minutes, then 100%. Define AnalysisTemplate resources that query Prometheus for success rate (>99.5% threshold) and p99 latency (<500ms threshold) comparing canary vs. stable using the istio_request_duration_milliseconds_bucket metric. Set maxSurge=1 and configure automatic rollback if any analysis step fails. Emit deployment events to Datadog for correlation with business metrics dashboards.

Tags

deploymentcanaryprogressive-deliveryobservabilityrelease