Search Knowledge

© 2026 LIBREUNI PROJECT

Software Architecture / Evolution

Observability & Architecture Fitness

Observability & Architecture Fitness

Architecture needs feedback. Without feedback, diagrams become wishes and decisions become folklore. OpenTelemetry’s model of traces, metrics, and logs supplies runtime evidence; evolutionary-architecture practice adds design evidence through automated fitness functions that verify dependency rules, contract compatibility, latency budgets, security policies, cost thresholds, and resilience expectations.

The goal is not to monitor everything. The goal is to know whether the system is meeting the qualities it was designed to protect. If availability is a top quality, the architecture needs SLOs and error-budget signals. If modifiability is a top quality, the architecture needs dependency checks, cycle detection, module ownership, and lead-time tracking. If cost efficiency matters, the architecture needs unit economics and capacity signals.

Code
left to right direction
rectangle "Architecture Intent" as Intent
rectangle "Runtime Telemetry" as Telemetry
rectangle "Fitness Functions" as Fitness
rectangle "Decision Review" as Review
rectangle "Architecture Evolution" as Evolution

Intent --> Telemetry : what to observe
Intent --> Fitness : what to verify
Telemetry --> Review : evidence
Fitness --> Review : evidence
Review --> Evolution : change decisions
Evolution --> Intent : updated intent
Architecture IntentRuntime TelemetryFitness FunctionsDecision ReviewArchitecture Evolutionwhat to observewhat to verifyevidenceevidencechange decisionsupdated intent

Observability for Architecture

Observability is the ability to understand system behavior from emitted signals. For architecture, the most useful signals often show relationships: service dependency maps, trace waterfalls, queue depth, saturation, error-budget burn, cache hit rates, database wait events, contract errors, deployment correlations, and tenant-level behavior.

Logs explain events. Metrics quantify trends. Traces show causality across boundaries. Events capture domain facts. Profiles reveal resource use. None is sufficient alone. A trace may show that checkout is slow because payment is slow, while metrics show the error-budget impact, logs show provider rejection details, and domain events reveal how many orders are stuck.

Code
left to right direction
rectangle "User Journey\nCheckout" as Journey
rectangle "Trace" as Trace
rectangle "Metrics" as Metrics
rectangle "Logs" as Logs
rectangle "Domain Events" as Events
rectangle "Architectural Insight" as Insight

Journey --> Trace : causal path
Journey --> Metrics : latency and errors
Journey --> Logs : detailed context
Journey --> Events : business progress
Trace --> Insight
Metrics --> Insight
Logs --> Insight
Events --> Insight
User JourneyCheckoutTraceMetricsLogsDomain EventsArchitectural Insightcausal pathlatency and errorsdetailed contextbusiness progress

SLOs and Error Budgets

Service-level objectives connect architecture to user experience. An SLO might say that 99.9 percent of checkout attempts complete successfully within two seconds over thirty days, excluding invalid payment details. The exact wording matters because it defines what users care about and what the team will optimize.

Error budgets create decision pressure. If a service is burning budget too quickly, reliability work becomes more important than feature release. If the service is comfortably within budget, the team may accept more change risk. This turns reliability from an abstract virtue into a management mechanism. The architecture should support measuring the SLO directly, not through proxies that hide user pain.

Code
left to right direction
actor "User" as User
rectangle "Critical Journey" as Journey
rectangle "SLI\nsuccess latency" as SLI
rectangle "SLO\n99.9 percent target" as SLO
rectangle "Error Budget" as Budget
rectangle "Release Decision" as Release
rectangle "Reliability Work" as Reliability

User --> Journey
Journey --> SLI
SLI --> SLO
SLO --> Budget
Budget --> Release : healthy
Budget --> Reliability : burning fast
UserCritical JourneySLIsuccess latencySLO99.9 percent targetError BudgetRelease DecisionReliability Workhealthyburning fast

Architecture Fitness Functions

A fitness function is an executable check that tells whether an architectural property still holds. It might fail the build if a domain module imports infrastructure, if an API change breaks a consumer contract, if a Terraform policy exposes a public database, if a service exceeds a latency budget in a performance test, or if a container image contains a critical vulnerability.

Fitness functions should be few, meaningful, and connected to decisions. Too many checks create noise. Too few checks let architecture decay. The best checks are those that prevent expensive drift: dependency direction, module boundaries, contract compatibility, security invariants, migration safety, and operational readiness.

Code
rectangle "Fitness Function Suite" as Suite {
rectangle "Dependency Rule Check" as Dep
rectangle "Contract Compatibility Test" as Contract
rectangle "Security Policy Check" as Security
rectangle "Performance Budget Test" as Perf
rectangle "Cost Threshold Alert" as Cost
}
rectangle "Pipeline" as Pipeline
rectangle "Production Telemetry" as Prod
rectangle "Architecture Review" as Review

Pipeline --> Dep
Pipeline --> Contract
Pipeline --> Security
Pipeline --> Perf
Prod --> Cost
Dep --> Review
Contract --> Review
Security --> Review
Perf --> Review
Cost --> Review
Fitness Function SuiteDependency Rule CheckContract Compatibility TestSecurity Policy CheckPerformance Budget TestCost Threshold AlertPipelineProduction TelemetryArchitecture Review

Socio-Technical Metrics

Architecture is socio-technical, so some fitness signals come from delivery and collaboration. Lead time, deployment frequency, change failure rate, time to restore, code ownership concentration, dependency wait time, review bottlenecks, and onboarding friction can expose architectural problems. If a simple feature requires five teams and three release windows, the architecture is communicating through delay.

These metrics should be interpreted carefully. They are signals, not weapons. DORA’s current guidance explicitly stresses application context and continuous improvement, so a high change failure rate should trigger diagnosis rather than blame. It may indicate brittle tests, unclear ownership, risky deployment, or excessive coupling. A long lead time may indicate compliance gates, unclear requirements, or architecture that forces cross-team coordination. Senior architects use the metrics to ask better questions, not to shame teams.

Code
left to right direction
rectangle "Architecture Health" as Health
rectangle "Runtime Signals" as Runtime
rectangle "Delivery Signals" as Delivery
rectangle "Team Signals" as Team
rectangle "User Signals" as User

Runtime --> Health : latency, errors, saturation
Delivery --> Health : lead time, deploy frequency
Team --> Health : ownership, cognitive load
User --> Health : task success, complaints
Health --> Runtime : improvement hypotheses
Architecture HealthRuntime SignalsDelivery SignalsTeam SignalsUser Signalslatency, errors, saturationimprovement hypotheseslead time, deploy frequencyownership, cognitive loadtask success, complaints

Feedback Cadence

Feedback has cadence. Some checks run on every commit. Some run nightly. Some are reviewed weekly. Some appear during quarterly architecture review. The cadence should match the risk. A public database exposure should fail immediately. A cost trend may need weekly review. A domain boundary concern may need review when change coordination rises.

Architecture review should be evidence-based. Instead of asking whether the system is “clean,” ask which decision assumptions are still true, which fitness functions are failing, which quality scenarios are at risk, and which options are closing. This makes architecture evolution a normal engineering practice rather than a special ceremony.

Practice

Choose three decisions from earlier modules and define one fitness function for each. At least one should be a pipeline check, one should be a production telemetry check, and one should be a delivery or organizational signal. State what action should happen when each check fails.

References & Further Reading