Marketing teams still say, "I can't tell how our campaign performed without checking the dashboard." Analysts write new SQL for every question. The CFO can't understand why CAC differs in every report. The problem isn't technical—you have a pipeline, sources are connected, data flows. The problem is architecture: there's no definition layer between raw tables and the dashboard. The dbt + BigQuery combination solves this: through source mapping, modeling layers, semantic layers, and exposures, you standardize data at the logic layer, not the visual layer.
Source Mapping: Binding Raw Data to Contract
Data flows into BigQuery from CRM, GA4, Meta Ads, Klaviyo. Every source uses different schemas, naming conventions, timestamp formats. dbt source mapping lets you declare these sources as code and test them. In a sources.yml file, you declare every table, set freshness controls, and test unique constraints.
Example source definition:
version: 2
sources:
- name: raw_ga4
database: analytics_lake
schema: raw_ga4_events
tables:
- name: events
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 12, period: hour}
columns:
- name: event_timestamp
tests:
- not_null
- name: user_pseudo_id
tests:
- not_null
This definition creates a contract: "Warn if GA4 events don't arrive in 6 hours, fail if they don't arrive in 12 hours." In production, this test ties to CI/CD—you catch source issues instantly. dbt docs auto-generate a lineage graph showing which dashboard depends on which source.
Without source mapping, analysts start with SELECT * FROM analytics_lake.raw_ga4_events.events. They don't know what each column means, there are no tests, no documentation. With dbt, you reference the source: {{ source('raw_ga4', 'events') }}. If the table name changes, you update it in one place—all downstream models adjust automatically.
Modeling Layer: Staging, Intermediate, Mart
dbt's power lies in modeling layers. You separate into three levels: staging (normalize raw format), intermediate (apply business logic), mart (final metric tables).
Staging layer: One 1:1 model per source. Only data type conversion, column naming, timestamp to UTC. No business logic.
-- models/staging/stg_ga4__events.sql
WITH source AS (
SELECT * FROM {{ source('raw_ga4', 'events') }}
)
SELECT
TIMESTAMP_MICROS(event_timestamp) AS event_at,
user_pseudo_id AS user_id,
event_name,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_url
FROM source
WHERE event_date >= CURRENT_DATE() - 90
Intermediate layer: Apply business logic. Define sessions, map product categories, apply attribution windows. These models don't go to end users—they feed downstream models.
-- models/intermediate/int_sessions.sql
WITH events AS (
SELECT * FROM {{ ref('stg_ga4__events') }}
),
session_windows AS (
SELECT
user_id,
event_at,
SUM(CASE WHEN TIMESTAMP_DIFF(event_at, LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at), MINUTE) > 30 THEN 1 ELSE 0 END)
OVER (PARTITION BY user_id ORDER BY event_at) AS session_index
FROM events
)
SELECT
user_id,
session_index,
MIN(event_at) AS session_start_at,
MAX(event_at) AS session_end_at,
COUNT(*) AS event_count
FROM session_windows
GROUP BY 1, 2
Mart layer: Final metric tables. These connect to dashboards, BI tools, Looker. Use fct_ (fact) or dim_ (dimension) prefixes.
-- models/marts/fct_daily_channel_performance.sql
SELECT
DATE(session_start_at) AS date,
traffic_source.medium AS channel,
COUNT(DISTINCT user_id) AS users,
SUM(revenue) AS revenue,
SAFE_DIVIDE(SUM(revenue), COUNT(DISTINCT user_id)) AS revenue_per_user
FROM {{ ref('int_sessions') }}
LEFT JOIN {{ ref('int_transactions') }} USING (user_id, session_index)
GROUP BY 1, 2
With this structure, analysts use fct_daily_channel_performance without touching staging or intermediate logic. When metric definitions change, you update one place—all dashboards stay consistent.
Semantic Layer: Code Your Metric Definitions
In BigQuery + dbt, "semantic layer" has two approaches: dbt metrics (deprecated 2023) or dbt semantic models (newer). A semantic model defines metrics in YAML, abstracting them from SQL. Tools like Looker, Tableau, and Mode read these definitions, calculating CAC, LTV, ROAS consistently.
Example semantic model:
# models/marts/semantic_models.yml
semantic_models:
- name: channel_performance
model: ref('fct_daily_channel_performance')
dimensions:
- name: date
type: time
type_params:
time_granularity: day
- name: channel
type: categorical
measures:
- name: total_revenue
agg: sum
expr: revenue
- name: total_users
agg: count_distinct
expr: user_id
metrics:
- name: revenue_per_user
type: derived
type_params:
expr: total_revenue / total_users
metrics:
- total_revenue
- total_users
With this, "revenue per user" calculates the same everywhere. An analyst selects "RPU" in Looker, the backend fetches it from dbt's semantic layer—no manual SQL. If definitions change (say, excluding canceled orders), you update one place.
Without a semantic layer, every dashboard rewrites "revenue / users." One report excludes refunds, another includes them. The CMO sees two numbers, trust erodes. Pairing this with a First-Party Data & Measurement Architecture is critical for production—you define attribution, consent, and TCF signals using the same framework.
Exposures: Track Data's Final Destinations
A dbt exposure answers: "which dashboard, ML pipeline, or operational system does this model feed?" You define them in exposures.yml:
exposures:
- name: marketing_dashboard
type: dashboard
maturity: high
url: https://lookerstudio.google.com/reporting/abc123
description: "CMO daily channel performance dashboard"
depends_on:
- ref('fct_daily_channel_performance')
owner:
name: Marketing Analytics Team
email: [email protected]
Exposures provide two things: impact analysis (which dashboards break if I change this model?) and stakeholder mapping (who owns this dashboard, who do I escalate to?).
In production, exposures flow like this: dbt build → test fails → check lineage graph for affected exposures → auto-post to Slack → dashboard owner gets early warning. Instead of users asking "why is the dashboard empty?", the CI/CD system tells you first.
Without exposures, the data team deploys models blindly, unaware of impact. Exposures tag every model as "live in production, be careful."
Incremental Models and Partitioning: Cost + Performance
Full table scans in BigQuery are expensive. 1 TB of data = $5 per query, 10 queries daily = $50, monthly = $1,500. dbt incremental models process only new rows—past data stays immutable.
{{ config(
materialized='incremental',
unique_key='event_id',
partition_by={'field': 'event_at', 'data_type': 'timestamp', 'granularity': 'day'},
cluster_by=['user_id', 'event_name']
) }}
SELECT * FROM {{ ref('stg_ga4__events') }}
WHERE event_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
{% if is_incremental() %}
AND event_at > (SELECT MAX(event_at) FROM {{ this }})
{% endif %}
This config optimizes: process only the last 2 days per run, old data untouched. partition_by enables BigQuery partition pruning, cluster_by increases query selectivity. Same dataset, 90% lower cost.
In production, incremental models + dbt snapshots implement SCD Type 2: track historical changes in dimension tables (user segment shifts, product category remapping). When an analyst asks "what segment was user X in last month," the snapshot provides consistent answers.
Production Pipeline: CI/CD, Tests, Alerts
Your dbt project lives on GitHub, every commit triggers a CI pipeline:
- Lint:
sqlfluffchecks SQL formatting - Test:
dbt testruns schema tests (not_null, unique, foreign_key) and data tests (revenue > 0, session_duration < 24h) - Build:
dbt build --select state:modified+rebuilds only changed models - Deploy: On merge to production, tables update in BigQuery
Test failures block merges. Example data test:
-- tests/assert_no_negative_revenue.sql
SELECT * FROM {{ ref('fct_daily_channel_performance') }}
WHERE revenue < 0
Passing means 0 rows returned; failing means negative revenue anomaly detected—pipeline halts.
Alert scenario: Schedule a dbt Cloud job (daily 6 AM), use on-run-end hooks to notify Slack:
on-run-end:
- "{{ post_to_slack_on_failure() }}"
Deploying this pipeline with Data Analytics & Insights Engineering takes 4–6 weeks: source mapping + staging + intermediate + mart + semantic model + exposures + tests + CI/CD.
Tradeoff: Complexity vs Control
The dbt + BigQuery stack has a steep learning curve. SQL knowledge isn't enough—you need Jinja templating, YAML config, Git workflows, CI/CD. For small teams (1–2 people), this overhead feels heavy; direct BigQuery views + Looker Studio starts faster.
But at scale (10+ dashboards, 50+ sources, 5+ analysts), control evaporates without dbt. Every analyst writes their own SQL, metrics contradict, no tests, no docs. dbt lets you pay down technical debt instead of accumulating it.
An alternative: Looker LookML for semantic layers. LookML resembles dbt (code-based metric definitions) but creates vendor lock-in and struggles with non-BigQuery sources. dbt is open-source, portable, and moves between BigQuery, Snowflake, and Redshift.
Modern marketing data stacks begin with source mapping, scale with semantic layers, and monitor with exposures. dbt + BigQuery codifies all three, making data testable, versioned, repeatable. You guarantee metric consistency without glancing at dashboards.