Cross-Platform Observability at Marcel

How I designed and built a cross-platform observability system coordinating Flutter Crashlytics, GCP structured logging, and Sentry — unified through shared correlation identifiers and an explicit error ownership contract.

ObservabilityFlutterLaravelArchitectureMay 01, 2026·9 min read

Cross-platform observability isn't about connecting tools — it's about making accountability explicit. Which system owns which errors, what identifiers cross the boundary, and how you reconstruct an incident when the failure happened on one platform and the symptom appeared on the other. This article covers the system I designed and built at Marcel to answer those questions.

Key takeaways

An error ownership contract prevents both double-reporting and silent gaps across platforms
Two correlation identifiers — per-request and per-device — are sufficient to link Crashlytics, GCP, and Sentry into a unified triage surface
Deterministic sampling by device prevents log storms from individual clients while preserving signal across the fleet

This is a companion to the Marcel stabilization case study. That post covers how I recovered a production Flutter app from structural instability. This one covers the observability system I designed and built in its wake — and the incident that forced the issue.

The Incident That Built the Argument

In early 2025, users were failing to complete onboarding on Marcel. The failure was in the book search step — searches were returning zero results. Nobody could reproduce it. Not me, not QA, not the client.

I investigated by reconstructing the full context: the specific onboarding step, the intermittency of the failure, my hypotheses about what could produce zero results without an error state. The root cause turned out to be a pagination bug where the search API was always querying the first page on an infinite scroll — every call was identical, the backend was rate-limiting the client at 429, and the app was silently returning empty results. The bug had existed since November 2024.

The insight wasn't just the fix. It was what the investigation exposed: I had no way to find that 429 pattern in production. No correlation between what the mobile client experienced and what the backend logged. No alert when rate-limiting began. No way to tell whether users were hitting this at scale or in isolation.

I had been advocating for proper observability for months. This incident gave me the argument I needed. The client approved the work.

Starting Point

The backend had 15 log statements across the entire codebase. No request_id. No shared context between log lines. No external error tracking. The exception handler wrote 5xx errors to a rotating file — visible only if someone actively opened Cloud Logging and searched for them.

Performance thresholds were set at 3 seconds for requests and 1 second for queries. Both fired constantly under normal load, filling logs with noise and burying any signal that existed.

On mobile, Firebase Crashlytics was configured but operating independently. No correlation to backend events. No shared identifier. A crash report on Crashlytics and a 500 on the backend could relate to the same user action, and there was no mechanism to know.

The target I designed for: any unhandled backend exception creates a Sentry issue with full context within seconds; any incident can be fully reconstructed in under 60 seconds using a single identifier present in the error response, the GCP log entry, and the Crashlytics event.

The Architecture

Two correlation identifiers

The system coordinates around two shared identifiers propagated as HTTP headers.

X-Request-Id is the per-request correlation key. The backend generates a UUID on every request (or accepts one from upstream), injects it into all Monolog context for that request's lifetime, and echoes it in both the response header and JSON error bodies. On the mobile side, the Flutter LoggingInterceptor extracts it from the response — checking the header first, falling back to the JSON body field — and attaches it to Crashlytics custom keys and HTTP error log messages. On the backend, every captured Sentry exception is tagged with it. A Crashlytics crash report can be directly traced to a specific GCP log entry and a Sentry issue using this single value.

The mobile client also sends X-Request-Id proactively on outgoing requests. This means correlation is available for requests that the backend receives but fails to process — not just for requests that return a response. If a request reaches the backend and triggers a 500 before the response middleware runs, the request ID is still present in the log context.

X-Installation-Id is the per-device correlation key. A UUID generated on first launch, stored locally, and injected by the Flutter InstallationIdInterceptor on every outgoing request. The backend resolves it to an Installation record and injects it into shared log context. On Sentry, every captured exception is tagged with it. It is also set as a permanent Crashlytics custom key, making it possible to filter Crashlytics events by device and cross-reference with backend logs that use it as a primary sampling key.

This identifier serves a second function: it's the primary key for the deterministic sampler on the backend.

The error ownership contract

The most important design decision in a cross-platform observability system isn't a tool choice — it's an explicit partition of accountability. Without it, systems either double-report events (alert volume doubles, noise wins) or leave gaps (each platform assumes the other is responsible).

I made the ownership explicit as a contract enforced mechanically on both sides:

Error type	Backend	Mobile
5xx server errors	GCP log + Sentry capture	Breadcrumb only — no Crashlytics event
4xx client errors	Sampled structured log	`LoggingInterceptor` warning only
429 rate limit	100% sampled log	Breadcrumb only
Connectivity / timeout	Not applicable	Full Crashlytics event — mobile owns this
Flutter / platform crash	Not applicable	Crashlytics fatal / non-fatal

The CrashlyticsInterceptor on mobile enforces the mobile side of this contract mechanically. A 5xx response triggers a breadcrumb log but no Crashlytics error event — the backend owns 5xx. A connectivity failure triggers a full CrashlyticsReporter.record() — the mobile client owns what the backend never saw. The ownership boundary is in code, not in documentation.

Deterministic sampling

Naive random sampling for noisy 4xx events creates an asymmetric debugging problem: whether a given error appears in logs depends on luck at query time rather than on the event itself. A single misbehaving client hammering validation endpoints produces unpredictable log volume and makes it hard to distinguish a pattern from a coincidence.

The sampler uses CRC32 hashing over 100,000 buckets. The same installation, user, or request always hashes to the same bucket and always produces the same sampling decision. Sampling key priority is installation_id > user_id > request_id. Per-status sample rates are individually tunable via environment variables and set conservatively: 429 at 100%, 403 and 401 at 10%, 422 at 5%, 404 and 400 at 1%.

The result is stable, predictable log volume per client. A device that consistently triggers validation failures generates a consistent, low-volume signal — easy to detect as a pattern, easy to filter, and not capable of flooding the log stream regardless of request rate.

Mobile crash classification and deduplication

The CrashlyticsReporter centralizes all mobile crash reporting with classification, context enrichment, and deduplication. Every error is classified before reporting — HTTP 4xx, HTTP 5xx, connectivity failures, out of memory, image decode failures, permission denials — with Crashlytics custom keys set per category. HTTP errors carry method, host, path, status, and request_id. This makes Crashlytics filterable by error shape rather than by raw exception type.

Deduplication suppresses repeated reports of the same error fingerprint within a 2-minute window, capped at 50 tracked fingerprints with LRU eviction. The fingerprint for HTTP errors combines method, host, path, and status code. This prevents a single broken screen from generating hundreds of Crashlytics events for the same underlying failure.

Backend structured logging

The backend writes to two sinks simultaneously: a human-readable rotating file for local debugging, and a JSON file consumed by the GCP Ops Agent for shipping to Cloud Logging. The GoogleOpsAgentFormatter maps Monolog fields to GCP schema keys and applies recursive sensitive-key redaction across 13 patterns — authorization headers, tokens, secrets, credentials — before serialization.

Request context is injected by middleware before the controller runs and extended after the response: request_id, installation_id, HTTP method, and path on entry; user_id, named route, status code, and duration on exit. Every log line within a request's lifetime carries the full context automatically.

Performance thresholds were raised from 3 seconds to 8 seconds for requests and from 1 second to 2 seconds for queries. Slow request warnings are rate-limited per route per 5 minutes via cache, preventing a single slow endpoint from generating repeated warnings that bury real errors.

What It Looks Like in Practice

Since deployment, the observability system has surfaced issues I would not have found otherwise — and more importantly, issues I could actually triage.

The workflow for any incident now starts with a request_id. If it surfaces in Crashlytics, it traces directly to a GCP log entry with full request context and to a Sentry issue with stack trace, breadcrumbs, and tags. If it surfaces in Sentry, it traces back to the mobile event that triggered it. The reconstruction that previously required active log-searching and guesswork now takes under a minute.

The 429 pattern that exposed the onboarding bug would have been visible immediately.

Honest Limitations

The system has known constraints worth naming.

The mobile logging stack emits to console only. In staging and production, the minimum log level is warning and error respectively, so most request-level logs are silent outside of development. There is no structured log aggregator on the Flutter side — Crashlytics captures crashes and non-fatal errors, but the detailed request-level logs visible in development don't ship anywhere in production. This is a deliberate trade-off: adding a mobile log aggregator adds cost and operational surface for a team that doesn't yet have the volume to justify it.

The CrashlyticsReporter deduplication window resets on app restart. A crash loop that triggers at startup — before the deduplication map is populated — could re-report the same error on each launch. This is an edge case worth watching as usage grows.

Tech Stack

Mobile: Flutter, Firebase Crashlytics, Dio interceptors
Backend: Laravel, GCP Cloud Logging (Ops Agent), Sentry
Coordination: X-Request-Id and X-Installation-Id HTTP headers, deterministic CRC32 sampler, explicit error ownership contract

Feature Flag Architecture Across 14 Tenants

Apr 26, 2026 · 9 min read