We use cookies. You have options. Cookies help us keep the site running smoothly and inform some of our advertising, but if you’d like to make adjustments, you can visit our Cookie Notice page for more information.
We’d like to use cookies on your device. Cookies help us keep the site running smoothly and inform some of our advertising, but how we use them is entirely up to you. Accept our recommended settings or customise them to your wishes.

Apache Airflow and the Power of BigQuery for GA Data

You can do a lot in GA but sometimes it’s what you can't do that hurts the most. Every use case is different but many customers have additional information that they would like to use in Google Analytics for remarketing or to inform their reports. This can vary from data modeling such as lifetime value or simpler data like users identified using an internal query. We will discuss how this can be accomplished automatically at regular intervals using Apache Airflow.

The scope of this blog is to discuss in high level terms the tools involved in carrying out this task and how they can be used to accomplish our goal. We won't be pasting code here or going in to developer level detail, however for any additional questions you can reach out to [email protected]


BigQuery is a fully managed data warehousing tool that allows you to get answers to very specific questions you have about your Google Analytics data. Let's say for example you have an ecommerce retail store online that has Google Analytics tracking. You may want to get a list of users who added certain items to their basket and then did not purchase. These users can be difficult to get from Google Analytics but the task is easy with BigQuery. There are many other use cases like this where we can get a list of users we would like to remarket to but cannot get this into Google Analytics. Now with BigQuery ML you can run models over your data all in SQL without all the knowledge previously required in AI and programming making this all the more useful for remarketing.

Google Analytics

Query time uploads in Google Analytics are extremely useful for adding in data quickly and regularly. They allow you to upload additional information that you can match to your users by using the client ID or user ID. Here at Merkle|Periscopix we use this feature in conjunction with BigQuery and Apache Airflow in order to get the most out of it. Query time takes effect in Google Analytics almost immediately, but there are some limitations to this feature:

  • You have to be a GA360 customer to get access
  • You are only allowed a maximum of 1GB of data to use per data import

However, for the time being we have found that this restriction is not too limiting. In order to hold our extra data we will need a custom dimension that will hold the values we want to associate with the user. I am not going to go into how to set up the Query time import here as it is as straightforward as selecting the column you want to match users on (client ID/ user ID etc) and selecting the custom dimension index values you want to place data into, however if you need more information you can refer to Google's own brilliant documentation.

Example advance remarketing pipeline diagram

Remarketing pipeline diagram

Apache Airflow

Airflow is a platform that allows you to programatically author and schedule automated solutions. It is open sourced so anyone can use it. Airflow is also the underlying technology to Google's Cloud Composer and because of this it has a lot of really useful Google Cloud based operators. Airflow allows you to code in Python, specific operations and transformations that can be run on a scheduled basis with error handling/emailing and good logging. We love Airflow, its flexibility and openness generally means that there are no restrictions on the tool. The scope of this blog does not cover how to install or code on Airflow but if you are interested I would recommend starting with the documentation.

Getting back to our problem, we can generate really useful data using BigQuery, and Google Analytics has a feature to allow extra data in to it, so how do we combine the two? Here at Merkle|Periscopix we have developed a pipeline that will transfer your valuable user data in to Google Analytics for activation and reporting purposes on a regular basis using Airflow. So how does it work?:

  1. We use the Apache Airflow BigQuery operator to run our desired query and store the results in a table.

  2. We then use a BigQueryToGoogleCloudStorage operator to export our newly created table as a CSV to a bucket in Google Cloud Storage.

  3. Then the CSV is downloaded as an object to our Airflow machine in a dataFrame format with the help of Pandas.

  4. The schema is then converted to be the same as the expected schema from our custom data import in Google Analytics.

  5. Using the Google Analytics Management API we upload the CSV to GA for processing.

After this is done we set Airflow to run this DAG at any interval we like; daily, hourly etc. Now in Google Analytics we can create Segments of users using the value in the custom dimension and then turn them into an audience. Then we can push this data out to DV360 and SA360 for remarketing. Depending on how often you want to run this you may consider adding a task in Airflow that will clean up previous uploads in Google Analytics so you can free up some space in case you get close to the 1GB limit.

If you have some customer data from your CRM, campaign cost data or even want to develop some propensity scores, get in touch to see how we can support.