We use cookies to keep the site working, understand how it’s used, and measure our marketing. You can accept everything, reject non-essentials, or pick what’s on.
Note: This Table of Contents is generated via field codes. To ensure page number accuracy after editing, please right-click the TOC and select "Update Field."
The Monday Morning Bottleneck
Every Monday at 08:15, a senior analyst at a mid-market omnichannel retailer opened three spreadsheets, a SQL console, and a stale CRM export. The task was always the same: reconcile the weekend's customer data across the point-of-sale system, the e-commerce platform, the loyalty programme, the email marketing tool, the call-centre logs, and the wholesale portal. It consumed roughly four hours. If a loyalty card number was mistyped at checkout, or if an e-commerce session was never linked to an email, those records drifted into an ever-growing orphan pool. By the time the reconciliation finished, it was already midday on Tuesday — and the resulting CSV was already out of date.
This scenario is not unique. According to the 2024 State of Analytics Engineering survey by dbt Labs, 42% of analytics teams report that data-quality and integration issues consume more than half of their working hours. For omnichannel retailers, the problem is acute: every channel captures customer interactions using a different primary key, a different schema, and a different cadence. The consequence is not merely wasted analyst time — it is a fragmented understanding of the customer that directly harms personalisation, churn prediction, and revenue attribution.
This article examines how one retailer replaced its weekly manual reconciliation with a fully automated, 47-minute analytics pipeline. The solution combined Airbyte for change-data capture ingestion, BigQuery as the central warehouse, dbt for layered transformation with deterministic identity resolution, Airflow for orchestration, and Elementary for data-quality monitoring. Within six weeks of launch, the CRM team adopted the output mart as their single source of truth, and a downstream ML churn model improved precision by 23% after training on the unified 18-month history.
The Omnichannel Data Challenge
Customer 360: Definition and Business Value
A Customer 360 view is a unified, comprehensive representation of a customer that aggregates data from every touchpoint — in-store purchases, online browsing behaviour, email engagement, loyalty transactions, support interactions, and wholesale orders — into a single coherent profile. In practice, it means that when a customer service agent looks up an account, they see not just the last POS transaction but the full relationship: the abandoned cart from two weeks ago, the loyalty points balance, the last email open, and the net lifetime value across all channels.
The business value of a reliable Customer 360 is well documented. Retailers with strong cross-channel data integration report 15–20% higher customer retention rates and 10–15% improvements in marketing-ROI attribution, according to multiple industry benchmarks. However, achieving this view requires solving two interrelated problems: first, reliably ingesting data from heterogeneous source systems at near-real-time cadence; and second, resolving identity across those systems when no single identifier is common to all.
The Identifier Fragmentation Problem
In the case studied here, the six source systems used three completely different customer identifiers:
Source System
Primary Customer Key
Secondary Identifiers
Data Cadence
POS (in-store)
Email address
Phone, loyalty card
E-commerce platform
UUID (anonymous)
Email (post-login)
Loyalty programme
Card number (numeric)
Email, phone
Email marketing
Email address
Device ID
Call centre CRM
Phone number
Email, case ID
Wholesale portal
Account code
Billing email
The critical insight is that no single column appears in all six systems. Email exists in four of six; phone in three; loyalty card in two. Any join strategy must therefore operate across multiple candidate keys and handle one-to-many and many-to-many relationships. A customer who shops in-store with a loyalty card, later logs into e-commerce with email, and then calls the call centre from a new phone number may appear as three separate entities unless the identity-resolution logic is sufficiently sophisticated.
The Cost of Inaction
Before the project, the retailer's analytics team spent an estimated 60% of its time on data reconciliation and quality remediation rather than actual analysis. A Monday morning task — the weekly customer master reconcile — consumed four analyst-hours every week, yielding a static CSV that was stale within 24 hours. Because the data was never unified in a persistent warehouse, downstream consumers (the CRM team, the marketing analytics pod, and the data-science team training churn models) each maintained their own ad-hoc copies with slightly different join logic. This led to conflicting customer counts, duplicated marketing sends, and an ML model that was training on incomplete histories, resulting in lower-than-expected churn-prediction accuracy.
Architecture and Implementation
Airbyte CDC Ingestion into BigQuery
The ingestion layer was built on Airbyte, an open-source data integration platform that supports over 350 connectors. Airbyte was selected for three reasons: its native BigQuery destination, its support for Change Data Capture (CDC) via database log replication, and its self-hosted deployment option (important for data-residency requirements in retail).
Each of the six source systems was connected to Airbyte via its respective connector. For the POS system and the wholesale portal — which expose data through PostgreSQL read replicas — Airbyte used log-based CDC to capture inserts, updates, and deletes in near-real-time. For the e-commerce platform, an incremental API sync with a 45-minute polling interval sufficed. The email marketing tool and the call-centre CRM were ingested via their respective REST APIs on the same 45-minute schedule.
Airbyte wrote all extracted data into a dedicated dataset in BigQuery named raw. Each source system received its own set of tables prefixed by the source name (for example, raw.pos_transactions, raw.ecommerce_events, raw.loyalty_members). Critically, no transformation was applied at this stage — the raw layer preserved source schemas exactly as delivered, providing a full audit trail and enabling re-processing if downstream logic changed. The Airbyte sync schedule was configured to run every 45 minutes, providing a maximum ingestion lag of 45 minutes per source.
dbt Medallion Architecture: Staging, Intermediate, and Mart Layers
With raw data landing in BigQuery every 45 minutes, the transformation layer — built entirely in dbt Core — was responsible for turning heterogeneous source tables into analytics-ready marts. The project adopted the medallion architecture, a three-tier pattern that has become a de facto standard in modern analytics engineering.
The medallion architecture organises transformations into three progressively refined layers:
Staging Layer (Bronze to Silver)
Staging models perform minimal, deterministic transformations on the raw tables. Each staging model corresponds to exactly one raw table and applies standardisation, renaming, type casting, and basic null handling. For example, stg_pos_transactions normalised column names from the POS system's PascalCase convention to snake_case, cast transaction timestamps to BigQuery TIMESTAMP type, and filtered out voided transactions. Staging models never join across sources — they exist solely to create a clean, predictable interface to the raw data.
Intermediate Layer (Silver to Gold)
The intermediate layer is where the most critical logic resides. The centrepiece of the intermediate layer was the identity-resolution model, int_identity_resolution, which took all staging tables and produced a canonical customer mapping. This model will be discussed in depth in the next section. Beyond identity resolution, intermediate models also included enrichment steps: for example, int_order_summary aggregated per-customer order metrics across POS and e-commerce, and int_engagement_scores computed a recency-frequency-monetary (RFM) score from all interaction data.
Mart Layer (Gold)
Mart models are the final, consumer-facing layer. They are designed around business entities and use-case access patterns. The primary output was mart_customer_360, a wide, denormalised table containing one row per canonical customer with columns for lifetime value, channel preference, loyalty tier, last interaction date, churn risk score, and all resolved identifiers. Additional marts included mart_marketing_segments (for campaign targeting) and mart_product_affinity (for recommendation engine training).
A key architectural decision was to build marts as materialised views in BigQuery rather than physical tables, allowing downstream BI tools to query the latest data without requiring a separate refresh step. For the marts that fed the ML pipeline, physical tables with a 47-minute refresh cycle were used to ensure consistent training snapshots.
Deterministic Identity Resolution in dbt
Identity resolution — the process of linking records that refer to the same real-world customer across different source systems — is the hardest technical challenge in building a Customer 360. The retailer chose a deterministic approach rather than a probabilistic one, prioritising transparency, auditability, and regulatory compliance.
Deterministic vs. Probabilistic Matching
Identity resolution broadly falls into two categories. Deterministic matching uses exact or near-exact matches on shared identifiers such as email addresses, phone numbers, or loyalty card numbers. It achieves accuracy rates of 70–80% and is fully auditable: for any canonical customer ID, you can trace back to the exact rule that assigned it. Probabilistic matching, by contrast, uses statistical models and fuzzy-matching algorithms (Jaro–Winkler distance on names, device-fingerprint correlation, IP-address clustering) to infer identity when no shared identifier exists. It increases match rates to 85–95% but introduces false positives and is harder to explain to regulators.
For this retailer, deterministic matching was the right starting point. The data contained high-quality email and phone fields in most systems, and the compliance team required that every identity link be traceable to a specific rule. The 94.7% match rate achieved on first run demonstrated that the available identifiers were sufficient for the vast majority of customers.
The Resolution Algorithm
The identity-resolution model in dbt followed a priority-ordered merge strategy with four passes:
Pass 1 — Email exact match: All records sharing the same lowercased, trimmed email address were grouped. If a customer existed in multiple sources with the same email, the system with the highest priority (POS > E-commerce > Loyalty > CRM > Wholesale > Email marketing) contributed the 'surviving' record attributes.
Pass 2 — Phone exact match: Records not matched in Pass 1 were grouped by normalised phone number (E.164 format). This captured customers who shopped in-store (where email may not have been collected) and later interacted via the call centre.
Pass 3 — Loyalty card cross-reference: Records still unmatched were linked via loyalty card number, which appeared in both the POS and loyalty programme systems. Because a loyalty card is a strong identifier (physically presented at checkout), these matches were treated as high confidence.
Pass 4 — Graph-based transitive closure: After the first three passes, the model performed a graph-based transitive closure. For example, if Customer A (from POS) was linked to Customer B (from e-commerce) via email, and Customer B was linked to Customer C (from loyalty) via phone, then A, B, and C were all assigned the same canonical ID. This step was implemented using a BigQuery SQL iterative CTE, which resolved connected components in the identity graph.
The output of the model was a mapping table, map_canonical_customer, with columns for every source-specific identifier and a single canonical_customer_id. This table was materialised as a physical table in BigQuery and refreshed on every dbt run. Downstream mart models joined on canonical_customer_id, abstracting away the complexity of the underlying identifier logic.
BigQuery Warehouse Design Considerations
BigQuery was selected as the central data warehouse for its serverless architecture, its native support for nested and repeated fields (useful for event-level data from e-commerce), and its tight integration with both dbt and Airbyte. Several design decisions were critical to the project's success:
Dataset organisation: Data was partitioned into four datasets: raw (Airbyte landing zone), staging (dbt staging models), intermediate (dbt business logic), and marts (consumer-facing tables). This separation enforced layer boundaries and simplified access control.
Partitioning and clustering: Fact tables (transactions, events) were partitioned by event_date and clustered by canonical_customer_id, reducing query costs by approximately 60% for common analytical queries that filter on customer and date range.
Cost management: BigQuery slot reservations were configured for the transformation workloads (which run predictably every 45 minutes), while on-demand pricing was used for ad-hoc analyst queries. This hybrid model kept the monthly BigQuery bill under budget while maintaining sub-hourly freshness.
Incremental models: Several dbt models were implemented as incremental models using BigQuery's INSERT OVERWRITE partition pattern. Rather than reprocessing the full 18-month history on every run, incremental models processed only the partitions that had changed since the last run, reducing average dbt run time from 38 minutes to under 12 minutes.
Data Quality Monitoring with Elementary
Even the best pipeline can drift. Source systems change schemas, API credentials expire, and upstream data-quality regressions silently corrupt downstream marts if undetected. To address this, the project integrated Elementary, an open-source data-observability platform built natively for dbt.
Elementary was configured to monitor three categories of tests on every dbt model:
Freshness tests: Elementary tracked the maximum age of data in each source table and each mart. If any source failed to update within its expected 45-minute window, Elementary sent a Slack alert to the data engineering channel. This caught several incidents where the POS replica fell behind due to replication lag.
Row-count anomaly detection: Elementary profiled the row counts of every model over a rolling 14-day window and flagged deviations exceeding two standard deviations. When the e-commerce connector began returning zero rows due to an API rate-limit change, Elementary detected the anomaly within one pipeline cycle.
Custom schema tests: dbt's built-in testing framework was extended with custom tests for the identity-resolution model, including a uniqueness test on canonical_customer_id (to catch duplicate assignments) and a referential-integrity test ensuring every canonical ID in the mart had at least one source-system record backing it.
The combination of these tests provided a safety net that allowed the team to reduce manual spot-checking from a daily activity to a weekly review of Elementary's automated weekly report.
Orchestration with Airflow
Apache Airflow served as the pipeline orchestrator, managing the dependency graph between Airbyte syncs, dbt model runs, and Elementary test execution. The DAG was configured as follows:
Step
Task
Schedule
Typical Duration
1
Airbyte sync (6 sources)
Every 45 min
2
dbt run (staging → intermediate → mart)
After step 1
3
Elementary test + alert
After step 2
4
ML feature store refresh
After step 3
The total end-to-end latency — from a new transaction appearing at the POS to that customer's updated profile landing in the mart — was 47 minutes at the 95th percentile. This sub-hourly cadence was a dramatic improvement over the previous weekly batch process and was fast enough for the CRM team's operational needs.
ML Model Improvements from Unified Data
One of the most impactful downstream effects of the Customer 360 mart was on the retailer's machine-learning churn-prediction model. Prior to the project, the model was trained on POS data alone — a six-month history of in-store transactions — because joining in e-commerce and loyalty data required a manual, one-off extract that was too fragile to automate.
With the unified mart, the data-science team gained access to 18 months of cross-channel history for every customer. They retrained the XGBoost churn model using features derived from the mart, including cross-channel engagement recency, loyalty-tier migration patterns, and e-commerce browse-to-purchase conversion rates. The retrained model achieved a 23% improvement in precision at the same recall threshold, which translated directly into more accurate churn-intervention targeting and an estimated annual savings of $180,000 in misdirected retention offers.
Limitations and Considerations
While the results were strong, several limitations and trade-offs should be acknowledged:
Deterministic matching ceiling: The 94.7% match rate is excellent for a first pass, but the remaining 5.3% of customers — those who interact through only one channel with no shared identifier — represent a blind spot. A future phase incorporating probabilistic matching (device fingerprinting, IP clustering) could raise the match rate to 98–99%, but would introduce complexity in auditability and regulatory compliance.
Real-time latency: A 47-minute pipeline is sub-hourly but not real-time. For use cases requiring true real-time personalisation (such as website product recommendations during an active session), a supplementary streaming layer (e.g., Pub/Sub to BigQuery streaming inserts) would be needed. The current architecture is optimised for operational analytics and ML feature serving rather than in-session decisioning.
Single-region dependency: Both BigQuery and Airbyte were deployed in a single GCP region. For global retailers with multi-region data-residency requirements, a more complex deployment with regional data zones and a centralised identity graph would be necessary.
Schema evolution risk: Although Elementary monitors for freshness and row-count anomalies, it does not automatically detect semantic schema changes (for example, a source system changing the definition of 'revenue' from gross to net). The team mitigates this with manual quarterly source-system reviews and dbt column-level documentation in dbt docs.
Scalability of the identity graph: The transitive-closure algorithm runs in BigQuery SQL using iterative CTEs. For the current customer base of approximately 2.3 million records, this runs in under two minutes. However, as the customer base grows beyond 10 million, the quadratic complexity of pairwise comparisons may require a move to a dedicated graph-processing engine such as Apache Spark GraphFrames or a purpose-built identity-resolution service.
Conclusion and Future Implications
This case study demonstrates that a modern data stack — Airbyte for ingestion, BigQuery for storage and compute, dbt for transformation, Airflow for orchestration, and Elementary for observability — can replace a four-hour weekly manual task with a fully automated, 47-minute pipeline. The key architectural decisions that drove this success were:
Adopting the medallion architecture to create clear separation of concerns between raw ingestion, business logic, and consumer-facing marts.
Implementing deterministic identity resolution with a transparent, rule-based algorithm that is fully auditable and compliant with data-governance requirements.
Using incremental dbt models to keep pipeline runtimes manageable as the historical dataset grows.
Layering Elementary's data-observability tests on top of dbt's native testing framework to catch regressions proactively rather than reactively.
The business outcomes speak for themselves: four analyst-hours per week recovered, a CRM team that adopted the mart as their single source of truth within six weeks, and a churn model that improved precision by 23% when retrained on the unified dataset. These results underscore a broader trend in analytics engineering: the shift from building data pipelines that merely move data to building data products that are reliable, timely, and directly consumable by business stakeholders.
Looking ahead, the retailer plans to extend the architecture in three directions. First, a real-time streaming layer using Google Pub/Sub and BigQuery streaming inserts will enable sub-minute freshness for in-session personalisation. Second, a probabilistic identity-resolution module (using device fingerprinting and behavioural similarity scoring) will attempt to close the remaining 5.3% identity gap. Third, the dbt project will be extended with a data-products catalogue, allowing business teams to discover and request new marts through a self-service portal.
For organisations facing similar omnichannel data fragmentation, the lesson is clear: the technology to build a unified Customer 360 is mature, open-source, and deployable in weeks rather than months. The harder challenge — and the one that determines success or failure — is establishing clear data contracts between source systems, designing identity-resolution logic that is both accurate and auditable, and investing in data-quality monitoring from day one.