Need some help getting to grips working with Google Analytics data in BigQuery? In this blog, we've compiled some example queries to help you get started. For further support or any questions/requests, please get in touch!
Google Analytics Dimensions and Metrics
Calculating Sessions
There are at least three BigQuery session calculations, the three main ones we use are:
-
Total unique session IDs ignoring the session-break at midnight (fullVisitorId+visitId).
-
Total unique session IDs including the session-break at midnight (fullVisitorId+visitStartTime).
-
Count of sessions excluding sessions comprising of non-interaction hits only (totals.visits).
The most accurate session identifier is:
CONCAT(STRING(visitId),"-",fullVisitorId) AS sessionId
The hyphen prevents excel from truncating the ID.
Extract Session Level Custom Dimension with Hit Level Dimensions
The following query extracts the value of a session level scoped custom dimension stored against index 1, alongside the hit level dimension of pagePath.
The process is as follows:
- Union multiple daily tables using TABLE_DATE_RANGE
- Apply the FLATTEN function over the initial SELECT ensuring we have handled the multi date UNION first.
- SELECT the required fields from the FLATTENed table, concatenating the fullVisitorId and visitStartTime to output a session ID
- Use an window function with the MAX aggregation function to extract the required custom dimension value for each session.
SELECT --Concatenate the fullVisitorId with the visitStartTime to construct a sessionId CONCAT(fullVisitorId,"-", STRING(visitStartTime)) AS sessionId, --Extract the value where index = 1, evaluate the MAX aggregation over the sessionId MAX(IF(customDimensions.index=1,customDimensions.value,NULL)) OVER (PARTITION BY sessionId ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS uid, hits.page.pagePath FROM --Unnest the hits record FLATTEN ( ( SELECT * FROM --Union all tables within the specified date range TABLE_DATE_RANGE([projectName:datasetId.ga_sessions_],TIMESTAMP("20180201"),TIMESTAMP("20180202"))), hits)
Time On Page
Using the LEAD function, we can find the hits.time of a subsequent hit, enabling the calculation of time on page:
SELECT *, ROUND((next_page_time-hits.time)/1000,2) AS timeOnPage_sec FROM ( SELECT CONCAT(fullVisitorId,"-",STRING(visitStartTime)) AS sessionId, hits.hitNumber, hits.time, hits.page.pagePath AS pagePath, LEAD(hits.time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hits.hitNumber) AS next_page_time FROM [projectName:datasetId.tableName] WHERE hits.type="PAGE" )
Next Page Path
When looking at individual user journeys, it is easy to see the next and previous page path in BigQuery.
Using the same function as for time on page, we can create our next page path dimension:
SELECT CONCAT(fullVisitorId,"-",STRING(visitStartTime)) AS sessionId, hits.hitNumber, hits.time, hits.page.pagePath AS pagePath, LEAD(hits.page.pagePath) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hits.hitNumber) AS next_page_path FROM [projectName:datasetId.tableName] WHERE hits.type="PAGE"
Previous Page Path
For previous page path, the function required is LAG, rather than LEAD:
SELECT CONCAT(fullVisitorId,"-",STRING(visitStartTime)) AS sessionId, hits.hitNumber, hits.time, hits.page.pagePath AS pagePath, LAG(hits.page.pagePath) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hits.hitNumber) AS prev_page_path FROM [projectName:datasetId.tableName] WHERE hits.type="PAGE"
Replicating Segments
Session Level Segments
We can create session level segments using hit level dimensions by evaluating the presence of a particular dimension value across a users whole session, using window functions. This works in the same way as setting a condition on the "page" field in a GA segment.
SELECT --Create sessionId CONCAT(fullVisitorId,"-",STRING(visitId)) AS sessionId, MAX(IF(hits.page.pagePath CONTAINS "analytics",TRUE, FALSE)) OVER (PARTITION BY sessionId ORDER BY hits.time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS analytics_pageSegment FROM TABLE_DATE_RANGE([projectName:datasetId.ga_sessions_], TIMESTAMP("START DATE"), TIMESTAMP('END DATE') ) WHERE --Only include PAGE hits hits.type="PAGE" )
Sequential Segments
The following query concatenates the page path for each page hit, for each session. This enables us to filter out sessions that don't match a particular sequence by filtering based on a regular expression.
SELECT * FROM ( SELECT --Create sessionId CONCAT(fullVisitorId,"-",STRING(visitId)) AS sessionId, --Concatenate the path for all touch points in a users session GROUP_CONCAT(hits.page.pagePath,">>") OVER (PARTITION BY sessionId ORDER BY hits.time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sequence FROM TABLE_DATE_RANGE([projectName:datasetId.ga_sessions_], TIMESTAMP("START DATE"), TIMESTAMP('END DATE') ) WHERE --Only include PAGE hits hits.type="PAGE" ) WHERE REGEXP_MATCH(sequence, r'(funnel_step_1[^>>]+>>funnel_step_2)') GROUP BY sessionId, sequence
Formatting Timestamps
Visit Start Time
visitStartTime is a Unix timestamp, to return this in a human readable format use SEC_TO_TIMESTAMP function:
SEC_TO_TIMESTAMP(visitStartTime) AS visitStartTime
Hit Time
hits.time records the time in milliseconds since the visitStartTime. If you want to determine the time between hits occurring in different sessions, you need to combine the visitStartTime with hits.time to create a Unix timestamp for the hit time.
The following code returns this timestamp in a human readable format:
SEC_TO_TIMESTAMP(INTEGER(visitStartTime+ROUND(hits.time/1000,0))) as hitTimestamp
Working with Dates
The following statements can be used in queries that require dynamic date ranges.
First Date of Last Week
DATE(DATE_ADD(UTC_USEC_TO_WEEK(TIMESTAMP_TO_USEC(CURRENT_TIMESTAMP()),1),-7,"DAY"))
Last Date of Last Week
DATE(DATE_ADD(UTC_USEC_TO_WEEK(TIMESTAMP_TO_USEC(CURRENT_TIMESTAMP()),6),-7,"DAY"))
Last Date of Last Month
The following works by finding the negative integer value of today's day of the month (i.e. 02/11/2017 = "2") and subtracting this from today's date, returning the last date of last month.
DATE(DATE_ADD(CURRENT_DATE(),-DAY(CURRENT_DATE()), "DAY"))
First Date of Last Month
For the date of the first day last month we first need to find the integer of the last date of last month, using the example above.
We can then subtract this integer from the last date of last month.
DATE(DATE_ADD(DATE(DATE_ADD(CURRENT_DATE(),-DAY(CURRENT_DATE()), "DAY")),-DAY(DATE(DATE_ADD(CURRENT_DATE() , -DAY(CURRENT_DATE()), "DAY")))+1,"DAY"))