We use cookies. You have options. Cookies help us keep the site running smoothly and inform some of our advertising, but if you’d like to make adjustments, you can visit our Cookie Notice page for more information.
We’d like to use cookies on your device. Cookies help us keep the site running smoothly and inform some of our advertising, but how we use them is entirely up to you. Accept our recommended settings or customise them to your wishes.
×

BigQuery: Getting Started With Your Google Analytics Data

Need some help getting to grips working with Google Analytics data in BigQuery? In this blog, we've compiled some example queries to help you get started. For further support or any questions/requests, please get in touch!

Google Analytics Dimensions and Metrics

Calculating Sessions

There are at least three BigQuery session calculations, the three main ones we use are:

  1. Total unique session IDs ignoring the session-break at midnight (fullVisitorId+visitId).

  2. Total unique session IDs including the session-break at midnight (fullVisitorId+visitStartTime).

  3. Count of sessions excluding sessions comprising of non-interaction hits only (totals.visits).

The most accurate session identifier is:

CONCAT(STRING(visitId),"-",fullVisitorId) AS sessionId

 

The hyphen prevents excel from truncating the ID.

Extract Session Level Custom Dimension with Hit Level Dimensions

The following query extracts the value of a session level scoped custom dimension stored against index 1, alongside the hit level dimension of pagePath.

The process is as follows:

  1. Union multiple daily tables using TABLE_DATE_RANGE
  2. Apply the FLATTEN function over the initial SELECT ensuring we have handled the multi date UNION first.
  3. SELECT the required fields from the FLATTENed table, concatenating the fullVisitorId and visitStartTime to output a session ID
  4. Use an window function with the MAX aggregation function to extract the required custom dimension value for each session.
SELECT
  --Concatenate the fullVisitorId with the visitStartTime to construct a sessionId
  CONCAT(fullVisitorId,"-", STRING(visitStartTime)) AS sessionId,
  --Extract the value where index = 1, evaluate the MAX aggregation over the sessionId
  MAX(IF(customDimensions.index=1,customDimensions.value,NULL)) OVER (PARTITION BY sessionId ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS uid,
  hits.page.pagePath
FROM
  --Unnest the hits record
  FLATTEN ( (
    SELECT
      *
    FROM
      --Union all tables within the specified date range
      TABLE_DATE_RANGE([projectName:datasetId.ga_sessions_],TIMESTAMP("20180201"),TIMESTAMP("20180202"))), hits)

 

Time On Page

Using the LEAD function, we can find the hits.time of a subsequent hit, enabling the calculation of time on page:

 SELECT
    *,
    ROUND((next_page_time-hits.time)/1000,2) AS timeOnPage_sec 
FROM (
    SELECT
      CONCAT(fullVisitorId,"-",STRING(visitStartTime)) AS sessionId,
      hits.hitNumber,
      hits.time,
      hits.page.pagePath AS pagePath,
      LEAD(hits.time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hits.hitNumber) AS next_page_time
    FROM
      [projectName:datasetId.tableName]
    WHERE
      hits.type="PAGE" )

 

Next Page Path

When looking at individual user journeys, it is easy to see the next and previous page path in BigQuery. 

Using the same function as for time on page, we can create our next page path dimension:

SELECT
     CONCAT(fullVisitorId,"-",STRING(visitStartTime)) AS sessionId,
     hits.hitNumber,
     hits.time,
     hits.page.pagePath AS pagePath,
     LEAD(hits.page.pagePath) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hits.hitNumber) AS next_page_path
   FROM
     [projectName:datasetId.tableName]
   WHERE
     hits.type="PAGE"

 

Previous Page Path

For previous page path, the function required is LAG, rather than LEAD:

SELECT
    CONCAT(fullVisitorId,"-",STRING(visitStartTime)) AS sessionId,
     hits.hitNumber,
     hits.time,
     hits.page.pagePath AS pagePath,
     LAG(hits.page.pagePath) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hits.hitNumber) AS prev_page_path
   FROM
     [projectName:datasetId.tableName]
   WHERE
     hits.type="PAGE"

 

Replicating Segments

Session Level Segments

We can create session level segments using hit level dimensions by evaluating the presence of a particular dimension value across a users whole session, using window functions. This works in the same way as setting a condition on the "page" field in a GA segment.

SELECT
  --Create sessionId
  CONCAT(fullVisitorId,"-",STRING(visitId)) AS sessionId,
  MAX(IF(hits.page.pagePath CONTAINS "analytics",TRUE, FALSE)) OVER (PARTITION BY sessionId ORDER BY hits.time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS analytics_pageSegment 
FROM
  TABLE_DATE_RANGE([projectName:datasetId.ga_sessions_], TIMESTAMP("START DATE"), TIMESTAMP('END DATE') )
WHERE
  --Only include PAGE hits
  hits.type="PAGE" )

 

Sequential Segments

The following query concatenates the page path for each page hit, for each session. This enables us to filter out sessions that don't match a particular sequence by filtering based on a regular expression.

SELECT
  *
FROM (
  SELECT
    --Create sessionId
    CONCAT(fullVisitorId,"-",STRING(visitId)) AS sessionId,
    --Concatenate the path for all touch points in a users session
    GROUP_CONCAT(hits.page.pagePath,">>") OVER (PARTITION BY sessionId ORDER BY hits.time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sequence
  FROM
    TABLE_DATE_RANGE([projectName:datasetId.ga_sessions_], TIMESTAMP("START DATE"), TIMESTAMP('END DATE') )
  WHERE
    --Only include PAGE hits
    hits.type="PAGE" )
WHERE
  REGEXP_MATCH(sequence, r'(funnel_step_1[^>>]+>>funnel_step_2)')
GROUP BY
  sessionId,
  sequence

 

Formatting Timestamps

Visit Start Time

visitStartTime is a Unix timestamp, to return this in a human readable format use SEC_TO_TIMESTAMP function:

SEC_TO_TIMESTAMP(visitStartTime) AS visitStartTime 

 

Hit Time

hits.time records the time in milliseconds since the visitStartTime. If you want to determine the time between hits occurring in different sessions, you need to combine the visitStartTime with hits.time to create a Unix timestamp for the hit time.

The following code returns this timestamp in a human readable format:

SEC_TO_TIMESTAMP(INTEGER(visitStartTime+ROUND(hits.time/1000,0))) as hitTimestamp 

 

 

Working with Dates

The following statements can be used in queries that require dynamic date ranges. 

First Date of Last Week

DATE(DATE_ADD(UTC_USEC_TO_WEEK(TIMESTAMP_TO_USEC(CURRENT_TIMESTAMP()),1),-7,"DAY"))

 

Last Date of Last Week

DATE(DATE_ADD(UTC_USEC_TO_WEEK(TIMESTAMP_TO_USEC(CURRENT_TIMESTAMP()),6),-7,"DAY"))

 

Last Date of Last Month

The following works by finding the negative integer value of today's day of the month (i.e. 02/11/2017 = "2") and subtracting this from today's date, returning the last date of last month.   

DATE(DATE_ADD(CURRENT_DATE(),-DAY(CURRENT_DATE()), "DAY"))

 

First Date of Last Month

For the date of the first day last month we first need to find the integer of the last date of last month, using the example above.

We can then subtract this integer from the last date of last month.

DATE(DATE_ADD(DATE(DATE_ADD(CURRENT_DATE(),-DAY(CURRENT_DATE()), "DAY")),-DAY(DATE(DATE_ADD(CURRENT_DATE() , -DAY(CURRENT_DATE()), "DAY")))+1,"DAY"))