Observability in distributed systems: practical monitoring, tracing and logging
Efficient observability in distributed architectures requires integrated strategies of metrics, tracing, and logging for debugging and performance.
Executive summary
Efficient observability in distributed architectures requires integrated strategies of metrics, tracing, and logging for debugging and performance.
Last updated: 3/8/2026
Executive summary
Observability is the ability to infer internal states of a system by observing only its external outputs. In monolithic architectures, debugging was relatively simple: you had logs in one place, metrics in another, and linear request traceability. In distributed systems, a single request passes through dozens of services, queues, databases, and caches — and understanding what happened when something fails becomes a distributed detective exercise.
For architects and tech leads, the decision is not "monitoring or not," but "what integrated strategy of metrics, tracing, and logging enables efficient debugging at acceptable P99 latency and MTTR (Mean Time To Recovery)." Observability silos (separating logs from metrics from traces) create data islands where insights get lost. Effective observability requires tight integration between the three layers: structured logging for debugging, metrics for alerting, and distributed tracing for flow correlation.
Observability pillar: Metrics, Logs and Traces
Metrics: Quantitative measures in real-time
Metrics are aggregated numerical measures informing about system behavior in time windows: requests per second, P95 latency, error rate, CPU utilization, memory usage.
Metric types:
- Counter: Monotonic increment (ex: requests.total, errors.total)
- Gauge: Value that goes up and down (ex: memory.current, connections.active)
- Histogram: Value distribution (ex: request_duration_seconds buckets)
Practical implementation (OpenTelemetry + Prometheus):
typescriptimport { Counter, Histogram, Gauge } from '@opentelemetry/api';
const requestCounter = new Counter({
name: 'http_requests_total',
description: 'Total HTTP requests',
labelNames: ['method', 'path', 'status']
});
const requestDuration = new Histogram({
name: 'http_request_duration_seconds',
description: 'HTTP request duration',
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
const activeConnections = new Gauge({
name: 'http_connections_active',
description: 'Active HTTP connections'
});
// Application middleware
app.use((req, res, next) => {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
requestCounter.inc({
method: req.method,
path: req.path,
status: res.statusCode
});
requestDuration.observe(duration);
activeConnections.dec();
});
next();
});When metrics solve:
- Real-time alerting (ex: error rate >5% in 5 minutes)
- Performance trends (ex: P99 latency increasing 20% per week)
- Capacity planning (ex: memory utilization projecting exhaustion in 30 days)
Metrics limits:
- Don't provide context of "what happened" in specific incident
- Aggregation loses individual detail (ex: which specific request caused the spike?)
- Difficult to correlate with source code
Logs: Records of discrete events
Logs are textual records of specific events: request received, database query executed, error thrown. Structured logging (JSON) enables efficient querying and parsing.
Structured logging:
typescriptimport { createLogger, format, transports } from 'winston';
const logger = createLogger({
format: format.combine(
format.timestamp(),
format.json()
),
transports: [
new transports.Console(),
new transports.File({ filename: 'app.log' })
]
});
// Structured logging example
app.use((req, res, next) => {
const requestId = req.headers['x-request-id'] || generateUUID();
req.requestId = requestId;
logger.info('Request received', {
requestId,
method: req.method,
path: req.path,
userId: req.user?.id,
userAgent: req.headers['user-agent']
});
next();
});
app.use((err, req, res, next) => {
logger.error('Request failed', {
requestId: req.requestId,
error: err.message,
stack: err.stack,
path: req.path,
userId: req.user?.id
});
res.status(500).json({ error: 'Internal server error' });
});Appropriate log levels:
- ERROR: Errors impacting user experience requiring immediate investigation
- WARN: Abnormal conditions not impeding execution (ex: successful retry)
- INFO: Normal operational events (ex: request completed)
- DEBUG: Detailed information for troubleshooting (ex: intermediate calculation values)
When logs solve:
- Debugging specific incidents
- Audit and compliance
- Tracking execution flow in complex code
Log limits:
- Massive volume hinders querying at scale
- No natural correlation between distributed services
- Unstructured logging is useless in production
Distributed Tracing: Flow correlation between services
Distributed tracing enables tracking a complete request across multiple services, databases, and caches. Each hop is a span, and spans connect in a trace with shared trace ID.
Core concepts:
- Trace: Complete representation of a request through architecture
- Span: Individual work unit (ex: HTTP request, database query)
- Trace ID: Unique identifier connecting all spans
- Span ID: Unique identifier for each span
- Parent Span ID: Connects spans in hierarchy
Practical implementation (OpenTelemetry):
typescriptimport { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
async function processOrder(orderId: string): Promise<void> {
const parentSpan = tracer.startSpan('processOrder', {
attributes: { orderId }
});
try {
// Span for database query
await tracer.withSpan(parentSpan, async () => {
const dbSpan = tracer.startSpan('queryOrders', {
parentSpan: parentSpan
});
try {
const order = await db.query('SELECT * FROM orders WHERE id = ?', [orderId]);
dbSpan.setAttribute('orderId', orderId);
dbSpan.setAttribute('queryDuration', dbSpan.duration);
} finally {
dbSpan.end();
}
});
// Span for external API call
await tracer.withSpan(parentSpan, async () => {
const apiSpan = tracer.startSpan('callPaymentService', {
parentSpan: parentSpan
});
try {
const response = await fetch(`https://payments.com/validate/${orderId}`);
apiSpan.setAttribute('responseCode', response.status);
} finally {
apiSpan.end();
}
});
} finally {
parentSpan.end();
}
}When tracing solves:
- Debugging latency in distributed architectures (which service is bottleneck?)
- Identifying cascade failures (where did failure start?)
- Understanding request flow in complex systems
Tracing limits:
- Sampling mandatory at scale (tracing 100% of requests is expensive)
- Requires adoption across all services (partial tracing is minimally useful)
- Visualization tools are complex and costly
Strategic integration: when to use each layer
Ideal use scenarios
Scenario 1: Error Rate Spike
- Metrics alert:
errors.totalincreases from 1% to 10% in 5 minutes - Logs detail: query logs show repeated database timeouts
- Traces correlate: trace reveals database timeout in 95% of spans
Scenario 2: Gradually increasing P99 latency
- Metrics show:
request_duration_secondsP99 increases from 200ms to 800ms in 2 weeks - Traces identify: database query spans represent 70% of total latency
- Logs explain: database logs reveal missing indexes in specific table
Scenario 3: Cascade failure in production
- Metrics alert: error rate and latency spikes simultaneously across multiple services
- Traces trace: single trace ID connects service A failure to service B failure
- Logs detail: service A logs show circuit breaker triggered
Observability silo anti-pattern
Implementing metrics in Prometheus, logs in Elasticsearch, and traces in Jaeger without integration. Each tool provides partial view and manual correlation is impossible at scale.
Solution: Unified tools (Grafana, Datadog, New Relic) or tight integration with OpenTelemetry for native correlation between metrics, logs, and traces.
Trade-offs and operational complexity
Data volume vs. insight value
The problem: At scale, observability volume can exceed business volume. Logging 100KB per request at 10,000 requests/second generates 1GB/second of logs.
Sampling strategies:
- Metrics: Rarely sample (counters and gauges have minimal cost)
- Logs: Sample debug logs, keep error logs at 100%
- Traces: Sample at 1-10% for normal traffic, 100% for errors and slow requests
Tool cost vs. debug value
SaaS tools (Datadog, New Relic, Splunk):
- High integration and ease of use
- Per-GB and per-host costs can be massive at scale
- Potential vendor lock-in
Self-hosted (Prometheus + Grafana + Loki + Tempo):
- Higher operational costs (maintenance, upgrades)
- Total control of data and privacy
- Reduced vendor lock-in
Context retention vs. query performance
Hot data vs. cold data:
- Hot (last 7 days): Indexed for fast querying, stored on fast SSD
- Cold (7-90 days): Compressed, stored on S3/GCS, slow querying
- Archive (90+ days): Only for audit and compliance, not debugging
Common anti-patterns
Anti-pattern: Console logging in production
Unstructured console logs without timestamps, without correlation ID. Useless logs for debugging at scale and impossible to parse in observability pipelines.
Anti-pattern: Tracing without context propagation
Implementing tracing in one service but not propagating headers (trace ID, span ID) to downstream services. Trace cannot follow complete request through architecture.
Anti-pattern: Metrics without alerting base
Massively collecting metrics without defining pragmatic alerts. Beautiful dashboard without alerts doesn't help in production incident debugging.
Anti-pattern: Inappropriate log levels
Using INFO for everything or ERROR for normal conditions (ex: successful retry). Logs become noise and real debugging gets lost in sea of irrelevant messages.
Observability maturity metrics
To evaluate team's observability maturity:
- MTTR (Mean Time To Recovery): Average time to resolve incidents. High maturity: <15 minutes; Low maturity: >2 hours.
- Coverage rate: Percentage of services with integrated tracing. High maturity: >95%; Low maturity: <50%.
- Alert precision: Percentage of alerts corresponding to real incidents. High maturity: >90%; Low maturity: <50% (many false positives).
- Query latency: Time to execute complex debugging queries. High maturity: <5 seconds; Low maturity: >1 minute.
Implementation next steps
Phase 1: Foundation (Months 1-3)
- Implement structured logging across all services
- Add correlation ID to all requests
- Collect basic metrics (request rate, error rate, latency)
- Create initial health check dashboards
- Establish basic alerts (ex: error rate >5%)
Phase 2: Distributed Tracing (Months 3-6)
- Implement OpenTelemetry SDK in all services
- Propagate tracing headers in service calls
- Visualize traces in tracing tool (Jaeger, Tempo, Datadog)
- Create latency dashboards per service
- Analyze traces to identify bottlenecks
Phase 3: Advanced Observability (Months 6-12)
- Implement intelligent sampling (100% for errors/slow requests)
- Create anomaly-based alerts (machine learning detection)
- Integrate logs, metrics, and traces in unified dashboard
- Automate runbooks based on incident patterns
- Implement SLOs (Service Level Objectives) and SLIs (Service Level Indicators)
Phase 4: Observability as Culture (Months 12+)
- Train team on debugging with observability
- Create incident response playbooks with integrated observability
- Implement blameless post-mortems with observability data
- Govern observability schema evolution
- Automate performance regression detection
Is your distributed architecture suffering from incident debugging, long MTTR, and lack of performance visibility? Talk about observability with Imperialis to implement integrated metrics, tracing, and logging strategies that reduce MTTR and improve reliability.
Sources
- OpenTelemetry Documentation — accessed on 2026-03
- Google SRE Book: Monitoring Distributed Systems — accessed on 2026-03
- Grafana: The Observability Stack — accessed on 2026-03
- Amazon CloudWatch Best Practices — accessed on 2026-03
- Distributed Tracing in Practice — accessed on 2026-03