GA4 12 min read

GA4 Direct Traffic Is Lying to You: How to Diagnose and Reclaim Misattributed Sessions

If direct traffic is your second-largest channel in GA4, it's almost certainly hiding misattributed sessions from email, AI chatbots, app deep links, and broken UTMs. Most of what GA4 calls 'direct' i

A
Ashwani Bhasin
·

If direct traffic is your second-largest channel in GA4, it’s almost certainly hiding misattributed sessions from email, AI chatbots, app deep links, and broken UTMs. Most of what GA4 calls “direct” isn’t direct at all, and the gap between what you see and what actually happened is wider than most analytics managers want to admit.

I’ve audited direct traffic on roughly forty GA4 properties over the last two years. The pattern is consistent: between 35% and 60% of (direct)/(none) sessions had a knowable source within the prior 30 days. That source was lost somewhere along the request chain. The bigger the brand, the worse the leak, because bigger brands have more email volume, more paid campaigns with sloppy redirects, more SSO walls, and more inbound from AI assistants.

This post is the playbook I actually use when a client asks why their direct traffic doubled. No theory, no “consider implementing.” Just the nine causes, the BigQuery to quantify them, the GTM fix to stop the bleed, and the workflow to reclassify what you already lost.

The 9 real causes of (direct)/(none)

Before you fix anything, you need a working mental model of how sessions land in the Direct bucket. GA4 assigns (direct)/(none) when a session_start event arrives with no UTM parameters, no gclid/gbraid/wbraid/dclid, no document.referrer, and no recent campaign in the user’s session state. Any one of these can be stripped.

Here are the nine causes I see in the wild, ranked by how much damage they typically do:

#CauseTypical impact on Direct %Where it shows up
1Untagged email, SMS, push, in-app notifications15–30%Spikes correlated with send times
2AI assistant referrer stripping (ChatGPT, Claude, Perplexity, Gemini)5–15% and growingSessions with no referrer hitting deep content URLs
3HTTPS → HTTP downgrade2–8% on legacy sitesReferrer drops when a secure page links to an insecure one
4Meta referrer-policy set to no-referrer or same-originVaries, can be hugeAudit the <meta> tag and HTTP headers
5Redirect chains that drop query strings5–20% on paid trafficClick trackers, vanity domains, shorteners
6App-to-web transitions (iOS/Android in-app browsers)5–10% on mobile-heavy sitesSessions from instagram.com, facebook.com showing as direct
7Single-page app pushState bugs that fire page_view without preserving campaign params3–10%Internal navigation overwriting session source
8Overzealous referral exclusion list2–7%Payment providers, SSO domains added “to be safe”
9Login walls and intermediate auth flows2–5%Sessions reset after Okta/Auth0 round-trips

Most teams chase #1 and stop there. That leaves at least half the leak unaddressed.

Untagged email, SMS, and lifecycle messaging

The obvious one. Every ESP I’ve worked with — Klaviyo, Iterable, Braze, HubSpot, Mailchimp — has a UTM auto-tagging setting buried somewhere, and it’s frequently off, partially configured, or applied inconsistently across campaign types. Transactional emails are the worst offender because marketing rarely owns them.

Quick test: pull the last 30 days of (direct)/(none) traffic, group by hour-of-day and day-of-week, and overlay your email send schedule. If the correlation is obvious to the naked eye, you have an email tagging problem.

AI assistant referrer stripping

This one is newer and growing fast. When ChatGPT, Claude, or Perplexity cite your page, the click that lands on your site often arrives with no referrer header, or with a referrer from a domain that doesn’t carry campaign context. Some assistants pass chat.openai.com or perplexity.ai as a referrer; others pass nothing. The result: a fast-growing slice of high-intent traffic dumped into Direct.

Identifying these sessions requires looking at landing page patterns. AI-sourced traffic tends to hit deep informational URLs (specific blog posts, documentation pages, comparison content) rather than the homepage. If your direct traffic to /blog/* URLs is growing faster than direct traffic to /, you’re seeing AI assistant leakage. We’ve started recommending clients append a ?ref=ai parameter to URLs they expose in llms.txt or structured data feeds, and instrument a custom channel group for it.

HTTPS → HTTP downgrades and meta referrer policies

If a secure page links to an insecure one, browsers strip the referrer by default. Less common now, but still relevant for clients with old subdomains or partner integrations on HTTP.

More common: a developer set <meta name="referrer" content="no-referrer"> or same-origin site-wide, often because someone read a security blog post about leaking auth tokens in URLs. The fix is to set it to strict-origin-when-cross-origin (which is also the modern browser default). This passes the origin to other sites without leaking the full path, so partners can still attribute clicks to you.

<!-- Correct for most marketing sites -->
<meta name="referrer" content="strict-origin-when-cross-origin">

Check both the meta tag and the Referrer-Policy HTTP header. If they conflict, the header wins.

Redirect chains that drop UTMs

This is the one that quietly destroys paid attribution. A campaign URL goes through a click tracker, then a vanity domain, then a 301 to the canonical product page. Somewhere in that chain, the query string gets dropped — usually because a developer wrote a redirect rule that doesn’t preserve query parameters.

Test it manually. Take a tagged URL from your last campaign, paste it into a redirect tracer (or just curl -IL), and watch what happens to the ?utm_* params at each hop. If they disappear at any step, you’ve found a leak.

curl -sIL "https://go.yourbrand.com/promo?utm_source=newsletter&utm_medium=email&utm_campaign=spring" \
  | grep -iE "^(location|HTTP)"

Quantifying the leak with BigQuery

You can’t prioritize what you can’t measure. If you have GA4’s BigQuery export enabled (and you should — it’s free up to 1M events/day on the standard tier), this query estimates what percentage of your direct sessions had a known source within the last 30 days.

The logic: for every session that landed as (direct)/(none), look back 30 days at the same user_pseudo_id and check whether any earlier session had a real source. If yes, that direct session is “probably misattributed.”

WITH sessions AS (
  SELECT
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
    TIMESTAMP_MICROS(event_timestamp) AS session_start_ts,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source') AS source,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium') AS medium,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS landing_page
  FROM `your-project.analytics_XXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX BETWEEN
    FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 60 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
    AND event_name = 'session_start'
),
direct_sessions AS (
  SELECT *
  FROM sessions
  WHERE (source = '(direct)' OR source IS NULL)
    AND (medium = '(none)' OR medium IS NULL)
    AND session_start_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
),
prior_known_source AS (
  SELECT
    d.user_pseudo_id,
    d.session_id,
    d.landing_page,
    MAX(CASE
      WHEN s.source IS NOT NULL AND s.source != '(direct)'
      THEN 1 ELSE 0
    END) AS had_prior_source
  FROM direct_sessions d
  LEFT JOIN sessions s
    ON s.user_pseudo_id = d.user_pseudo_id
    AND s.session_start_ts < d.session_start_ts
    AND s.session_start_ts >= TIMESTAMP_SUB(d.session_start_ts, INTERVAL 30 DAY)
  GROUP BY 1, 2, 3
)
SELECT
  COUNT(*) AS direct_sessions,
  SUM(had_prior_source) AS likely_misattributed,
  ROUND(100 * SUM(had_prior_source) / COUNT(*), 1) AS pct_misattributed
FROM prior_known_source;

This is a lower bound, not an upper bound. It only catches users who had a prior known session — first-time visitors with referrer stripping won’t show up. In practice, if this query returns 40%, your real leak is closer to 55–65%.

For deeper diagnosis, group by landing page or hour-of-day to find patterns. A spike of direct sessions to /checkout/success every Tuesday at 9am is your weekly newsletter going out untagged.

GTM recipe: preserve UTMs across redirects and login walls

Here’s the pattern I use to keep campaign parameters alive across SSO flows, paywalls, and any intermediate redirects that strip the query string.

The idea: on first page load, if the URL has UTM parameters, stash them in sessionStorage. On every subsequent page load (including after redirects), check sessionStorage and write the params back into the GA4 event payload if the current page lacks them.

In GTM, create a Custom JavaScript variable called cjs.persistedCampaign:

function() {
  var KEYS = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content','gclid','gbraid','wbraid','msclkid'];
  var url = new URL(window.location.href);
  var hasAny = KEYS.some(function(k){ return url.searchParams.has(k); });

  // If current URL has UTMs, persist them
  if (hasAny) {
    var payload = {};
    KEYS.forEach(function(k){
      var v = url.searchParams.get(k);
      if (v) payload[k] = v;
    });
    payload._ts = Date.now();
    try {
      window.sessionStorage.setItem('aum_campaign', JSON.stringify(payload));
    } catch(e) {}
    return payload;
  }

  // Otherwise, try to restore from sessionStorage (within 30 min window)
  try {
    var stored = window.sessionStorage.getItem('aum_campaign');
    if (!stored) return undefined;
    var parsed = JSON.parse(stored);
    if (Date.now() - parsed._ts > 30 * 60 * 1000) return undefined;
    return parsed;
  } catch(e) {
    return undefined;
  }
}

Then in your GA4 Configuration tag (or every GA4 Event tag, depending on your setup), add these as event parameters:

  • campaign_source = {{cjs.persistedCampaign}}.utm_source (using a second helper variable per key)
  • campaign_medium = {{cjs.persistedCampaign}}.utm_medium
  • …and so on

Or pass them as a single JSON blob and split server-side. Either works.

Why sessionStorage and not localStorage: you don’t want a UTM from three weeks ago hijacking a genuinely new session. The 30-minute timestamp check above mirrors GA4’s default session timeout.

This approach breaks when: the redirect crosses domains and you don’t have cross-domain measurement configured, or when the browser opens the destination in a new tab without inheriting sessionStorage. For cross-domain flows, you also need to manually append the params to outbound links in your sGTM or in a Click trigger. If you’re doing this kind of work at scale, our GTM service has the patterns prebuilt.

Audit your referral exclusion list

Every GA4 property I’ve audited had at least one wrongly-excluded domain in the referral exclusion list (technically the “list unwanted referrals” config under Data Streams → Configure tag settings).

The rule is simple: only exclude domains that legitimately bounce a user back to your site mid-session (payment processors, SSO providers, your own subdomains). Excluding anything else hides real traffic.

Bad exclusions I’ve seen:

Excluded DomainWhy It Was Wrong
mail.google.comTreats Gmail webmail clicks as direct instead of referral
t.coHides Twitter/X traffic that wasn’t UTM-tagged
linkedin.comHides organic LinkedIn referrals
bing.comSomeone confused referral exclusion with channel grouping
*.yourbrand.com (wildcard)Hides legitimate subdomain referrals that should be tracked

What you actually want excluded:

  • Your payment processor’s hosted checkout (checkout.stripe.com, paypal.com if redirected)
  • Your SSO provider (accounts.google.com if you use Google SSO, login.microsoftonline.com)
  • Auth0/Okta tenants where users round-trip back to your domain
  • Your own root domain and any subdomains you’ve set up for cross-domain measurement

Pull your current list. For every entry, ask: “does a logged-in or paying user pass through this domain and come back to mine?” If no, remove it.

Setting referrer-policy correctly

Your referrer policy is a two-way street. If you strip referrers on outbound, your partners can’t attribute traffic to you and won’t prioritize you in their reports. If your partners strip referrers, you lose the data.

Recommended setting for marketing sites:

<meta name="referrer" content="strict-origin-when-cross-origin">

Or via HTTP header:

Referrer-Policy: strict-origin-when-cross-origin

This sends the full URL to same-origin requests, just the origin to cross-origin requests over HTTPS, and nothing on HTTPS→HTTP downgrades. It’s the modern browser default, but many sites override it with stricter values without realizing the attribution cost.

Test your current policy with the browser dev tools Network tab. Click an outbound link and check the Referer header on the destination request. If it’s missing or just an origin when you expected a full path, you know what to fix.

Reclassifying historical direct traffic in BigQuery

You can’t change what GA4 already recorded, but you can build a corrected view downstream. The approach: for every direct session, look back N days for the most recent known source and reassign.

WITH all_sessions AS (
  SELECT
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
    TIMESTAMP_MICROS(event_timestamp) AS session_ts,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source') AS source,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium') AS medium,
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'campaign') AS campaign
  FROM `your-project.analytics_XXXXXXXX.events_*`
  WHERE event_name = 'session_start'
),
reclassified AS (
  SELECT
    a.user_pseudo_id,
    a.session_id,
    a.session_ts,
    a.source AS original_source,
    a.medium AS original_medium,
    COALESCE(
      a.source,
      (SELECT b.source FROM all_sessions b
       WHERE b.user_pseudo_id = a.user_pseudo_id
         AND b.session_ts < a.session_ts
         AND b.source IS NOT NULL
         AND b.source != '(direct)'
       ORDER BY b.session_ts DESC LIMIT 1)
    ) AS reclassified_source,
    COALESCE(
      a.medium,
      (SELECT b.medium FROM all_sessions b
       WHERE b.user_pseudo_id = a.user_pseudo_id
         AND b.session_ts < a.session_ts
         AND b.medium IS NOT NULL
         AND b.medium != '(none)'
       ORDER BY b.session_ts DESC LIMIT 1)
    ) AS reclassified_medium
  FROM all_sessions a
)
SELECT * FROM reclassified
WHERE original_source = '(direct)' OR original_source IS NULL;

Use this view in Looker Studio alongside (not instead of) GA4’s native

#GA4#attribution#traffic sources#UTM tracking

Share this article

Want This Implemented Correctly?

Let our team apply these concepts to your specific setup — with QA validation and 30 days of support.