Modern Marketing Data Stack with dbt + BigQuery

In 2026, marketing teams don't wrestle with data—they make decisions from it. GA4, Meta Ads, Google Ads, CRM, CDP, server-side GTM—each dumps into its own table. The team manually stitches data in spreadsheets, numbers shift every week, nobody trusts the source. This chaos vanishes with a modern data stack: BigQuery as the warehouse, dbt as the transformation layer, semantic layer as the metric graph. You version control your code, test every change, and metrics flow from a single source of truth. This article shows how dbt + BigQuery turns a marketing data pipeline into production-grade infrastructure.

Source mapping: Standardizing raw data paths

dbt's first job is source mapping—fitting raw data from different systems into the same schema. In BigQuery, analytics_123456.events_* comes from GA4, facebook_ads.ads_insights from the Meta API, crm.transactions from Shopify. Each has a different timestamp format, different user identifier, different currency column. You define these raw tables in dbt's sources.yml:

version: 2
sources:
  - name: ga4
    database: analytics_123456
    tables:
      - name: events_
        identifier: "events_*"
        loaded_at_field: event_timestamp
  - name: meta_ads
    database: facebook_ads
    schema: public
    tables:
      - name: ads_insights
        loaded_at_field: date_start

This definition tells dbt: "These are upstream sources, I don't own them, but I'll test their freshness." Running dbt source freshness checks when the last data arrived—if Meta API is delayed, an alert fires. Without source mapping, every model writes SELECT * FROM analytics_123456.events_20260614 directly; when the table name changes, 40 models break. With mapping, the reference becomes {{ source('ga4', 'events_') }}, and changes propagate from a single point.

GA4 event_timestamp is Unix microseconds, Meta ads date_start is an ISO string, CRM created_at is UTC datetime—all different formats. In source mapping, you extract a standard timestamp column: TIMESTAMP_MICROS(event_timestamp) AS event_time for GA4, PARSE_TIMESTAMP('%Y-%m-%d', date_start) AS event_time for Meta. This normalization feeds clean input downstream.

Modeling layer: Staging, intermediate, mart

dbt's power lies in layered modeling—staging, intermediate, and mart tiers. Staging models pull 1:1 from sources, only renaming and casting types. stg_ga4_events.sql:

SELECT
  TIMESTAMP_MICROS(event_timestamp) AS event_time,
  user_pseudo_id AS anonymous_id,
  event_name,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'session_id') AS session_id,
  geo.country,
  device.category AS device_category
FROM {{ source('ga4', 'events_') }}
WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
  AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())

Staging delivers clean data with no business logic. Intermediate models layer in business logic: sessionization, attribution, funnel steps. int_sessions.sql aggregates GA4 events to session grain:

WITH session_events AS (
  SELECT
    session_id,
    MIN(event_time) AS session_start,
    MAX(event_time) AS session_end,
    COUNT(DISTINCT CASE WHEN event_name = 'page_view' THEN event_time END) AS pageviews,
    MAX(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS converted
  FROM {{ ref('stg_ga4_events') }}
  GROUP BY session_id
)
SELECT
  *,
  TIMESTAMP_DIFF(session_end, session_start, SECOND) AS duration_seconds
FROM session_events

Mart models are the final consumption layer—BI tools, Looker, internal dashboards query here. fct_marketing_performance.sql unifies all channels, calculates spend + revenue + ROAS. Each mart focuses on one business entity: dim_customers, fct_orders, fct_sessions. Mart naming convention is critical—dim_ for dimensions (customer, product), fct_ for facts (transaction, event), rpt_ for report aggregates.

Semantic layer: KPI definitions as code

The semantic layer pulls metric definitions into dbt—"what is revenue," "how is CAC calculated"—no longer a spreadsheet, now YAML in your codebase. With dbt v1.6+, you build your metric graph in metrics.yml:

version: 2
metrics:
  - name: revenue
    label: Revenue
    model: ref('fct_orders')
    calculation_method: sum
    expression: order_amount
    timestamp: order_date
    time_grains: [day, week, month, quarter]
    dimensions:
      - channel
      - country
      - device_category

  - name: cac
    label: Customer Acquisition Cost
    calculation_method: derived
    expression: "{{ metric('ad_spend') }} / {{ metric('new_customers') }}"
    timestamp: acquisition_date
    time_grains: [month, quarter]

With a semantic layer, your BI tool doesn't calculate CAC—dbt does. When Looker asks for CAC, dbt returns compiled SQL that joins spend and new customer tables, then divides. The definition is code, tracked in git history—"who changed the CAC formula and why" has an answer. The spreadsheet formula doesn't vanish; it has version control.

At Roibase, semantic layer setup is part of data analytics & insights engineering scope—not just metric definitions, but KPI tree mapping, dimension hierarchy, and grain standardization. For example, "revenue" is the sum of fct_orders.order_amount, but "recognized_revenue" filters the same table by recognized_at timestamp (SaaS subscription model). One table, two metrics, different business logic.

Exposures: Making downstream dependencies visible

An exposure answers dbt's "who uses this model" question. If a Looker dashboard queries fct_marketing_performance, you define it in exposures.yml:

version: 2
exposures:
  - name: marketing_dashboard
    type: dashboard
    maturity: high
    owner:
      name: Growth Team
      email: [email protected]
    depends_on:
      - ref('fct_marketing_performance')
      - ref('dim_customers')
    description: "Executive marketing dashboard — daily refresh, 90-day rolling window"
    url: https://looker.company.com/dashboards/123

Without exposure definitions, when you change fct_marketing_performance, you don't know which dashboard breaks. Looker shows zero metrics, you debug for 2 hours. With exposures, running dbt compile --select +exposure:marketing_dashboard shows all upstream models—impact analysis before you change anything.

Exposures aren't just BI tools—they include reverse ETL (Hightouch, Census):

exposures:
  - name: meta_capi_sync
    type: application
    maturity: high
    depends_on:
      - ref('dim_customers')
    description: "Meta Conversion API — incremental customer events, 5-minute delay"

This signals: "If you change dim_customers, you'll break the schema sent to Meta CAPI." In production, model update → CAPI sync error → attribution data loss is prevented by early warning.

Production pipeline: Incremental builds and test coverage

dbt in production doesn't run full refresh every day—it uses incremental models. fct_orders.sql reprocesses only the last 3 days:

{{ config(
    materialized='incremental',
    unique_key='order_id',
    partition_by={'field': 'order_date', 'data_type': 'date'},
    cluster_by=['customer_id', 'channel']
) }}

SELECT
  order_id,
  customer_id,
  order_date,
  order_amount,
  channel
FROM {{ ref('stg_shopify_orders') }}

{% if is_incremental() %}
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}

Incremental builds cut BigQuery costs by 90%—scanning 50GB instead of 2TB. Partition + cluster boosts query performance: a WHERE customer_id = 'X' query hits only the relevant cluster, no full scan.

Test coverage is critical. In dbt's schema.yml, you write tests for every model:

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: order_amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"
      - name: order_date
        tests:
          - dbt_utils.recency:
              datepart: day
              interval: 7

dbt test asserts these conditions in BigQuery—if order_amount goes negative, the build fails. In production, every commit runs through CI/CD: dbt run --select state:modified+ → dbt test --select state:modified+. Modified models and downstream dependencies run and test; if all passes, merge is allowed.

Orchestration: Airflow, Prefect, dbt Cloud

dbt itself isn't an orchestrator—Airflow or Prefect schedules it. Example Airflow DAG:

from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.operators.bash import BashOperator

dbt_run = BashOperator(
    task_id='dbt_run',
    bash_command='cd /opt/dbt && dbt run --profiles-dir .',
    dag=dag
)

dbt_test = BashOperator(
    task_id='dbt_test',
    bash_command='cd /opt/dbt && dbt test',
    dag=dag
)

dbt_run >> dbt_test

dbt Cloud is the alternative—managed orchestration, Web IDE, Slack alerts. Most enterprises prefer Airflow because there are tasks beyond dbt: upstream API pulls, downstream reverse ETL, snapshot tables.

Schedule strategy ties to data freshness. GA4 events have 24-hour processing lag (processing_date ≠ event_date); Meta Ads Insights API isn't real-time. Staging models trigger on source freshness—when GA4 sends a new partition, stg_ga4_events refreshes, cascading through intermediate → mart. An Airflow sensor checks BigQuery's _TABLE_SUFFIX:

wait_for_ga4 = BigQueryTableExistenceSensor(
    task_id='wait_for_ga4_partition',
    project_id='analytics_123456',
    dataset_id='events_',
    table_id=f"events_{yesterday.strftime('%Y%m%d')}",
    poke_interval=300
)

Once the partition is ready, the dbt chain triggers. This pattern solves late-arriving data—instead of blocking on API delays, the pipeline waits.

Tradeoffs: What dbt doesn't solve

dbt is a transformation engine, not a data loader. Who pulls data into BigQuery? Fivetran, Airbyte, custom Python scripts. dbt assumes raw data is already there in its source definitions. The pattern is ELT: Extract-Load-Transform. Different from ETL—the Transform happens inside the warehouse. dbt handles the T; EL is a separate toolchain.

dbt doesn't support real-time streaming. Kafka → BigQuery streaming insert → dbt incremental model chain introduces minute-level latency. Sub-second use cases (fraud detection, dynamic pricing) need stream processors—Flink, Spark Structured Streaming, or Materialize. dbt isn't the tool.

dbt's Python model support (v1.3+) is limited. You can do pandas dataframe manipulation:

def model(dbt, session):
    df = dbt.ref('stg_orders').to_pandas()
    df['log_amount'] = np.log1p(df['order_amount'])
    return df

But you're not training scikit-learn models here. BigQuery compute is expensive, Python runtime overhead is high. Complex transformations run faster in SQL. Feature engineering lives in dbt; model training happens in Vertex AI; inference in BigQuery ML. That's the pattern.

What to do now

If your marketing data still lives in manually stitched spreadsheets, the first move is setting up raw data flow to BigQuery. GA4 export, Meta/Google Ads API connectors (Fivetran, Supermetrics), CRM webhooks → BigQuery streaming insert. Once raw data is there, you create a dbt repository: staging models for source mapping, intermediate models for sessionization and attribution, mart models for final KPIs. In the first two weeks, fct_sessions and fct_orders are enough—dashboards point here, metrics stabilize. The semantic layer lands in week three, exposure mapping in week four. By week six, your production pipeline is git-controlled, test-covered, and incremental-optimized. Spreadsheets become read-only archives.