Infraadvanced

Application Monitoring (APM)

Monitor application performance, trace requests across services, and identify bottlenecks using APM instrumentation.

Also known as: APM, application performance monitoring, application monitoring, performance monitoring, tracing

Description

Application Performance Monitoring (APM) provides deep visibility into application behavior by automatically instrumenting code to track request traces, database queries, external HTTP calls, and resource utilization. APM tools capture the full lifecycle of each request as a distributed trace, breaking it into spans that represent individual operations (route handler, database query, cache lookup, external API call) with precise timing information.

Modern APM solutions follow the OpenTelemetry (OTel) standard, which provides vendor-neutral instrumentation APIs, SDKs, and auto-instrumentation libraries. OTel collects three signals: traces (distributed request paths), metrics (counters, gauges, histograms for throughput, latency, error rates), and logs (correlated via trace context). This data is exported to backends like Datadog, New Relic, Grafana Tempo, Jaeger, or AWS X-Ray for visualization and alerting.

Key APM metrics include request throughput (requests per second), error rate (percentage of failed requests), and latency percentiles (p50, p95, p99). The RED method (Rate, Errors, Duration) and USE method (Utilization, Saturation, Errors) provide frameworks for monitoring services and resources respectively. APM enables identifying slow database queries, N+1 query patterns, external service latency, memory leaks, and connection pool exhaustion. Alerting should be based on SLOs (Service Level Objectives) such as 'p99 latency under 500ms for 99.9% of 5-minute windows.'

Prompt Snippet

Integrate OpenTelemetry for distributed tracing using @opentelemetry/sdk-node with auto-instrumentation for HTTP, Express, PostgreSQL (pg), and Redis (ioredis). Configure the OTLP exporter to send traces and metrics to a Grafana Tempo/Mimir backend via the OTel Collector running as a sidecar. Set sampling to 10% for normal traffic and 100% for requests with errors or latency >1s (tail-based sampling in the collector). Propagate W3C trace context headers (traceparent, tracestate) across service boundaries. Define SLO-based alerts: error rate <0.1% and p99 latency <500ms evaluated over 5-minute windows, firing to PagerDuty when budget burn rate exceeds 10x.