If you collect data in GA4, you certainly use exports to BigQuery.
So you may have already noticed that at a given moment, the data was not exactly the same between the 2 tools.
At Datama, we implemented a scenario for one of our clients (called Client A):
The GA4 data is queried in BigQuery and then connected to our Datama Detect tool, which automatically triggers an anomaly email if the data is incomplete.
Over the past few weeks, we’ve started to receive these detection emails more and more frequently.
After an initial analysis, we noticed a discrepancy between partitioned tables and intraday & interday events.
We previously set up a Scheduled Query (via BigQuery) to retrieve GA4 data at a fixed time. The result is a table partitioned into days.
Reminder: Intraday vs. Interday Events
Our Client Use Case
In theory, intraday table data is supposed to “transition” to the interday table after one day.
Therefore, we decided to copy the tables twice a day (at 09:00 AM and 06:PM) for 1 week to analyze the evolution of data completion.
We selected 4 events:
- Number of sessions
- Number of transactions
- Number of out-of-stock items (custom event)
- Item views (custom event)
Our Findings
First, let’s see how completed the data is after the first copy (Day+1 AM):
- While we could have hoped to be close to 100% of the data available, this is not the case for the majority of events (with a few exceptions). Most events even have 5 to 15% missing data, which is not negligible.
- For Transactions event, data is already 100% available for 3 dates (22.06 to 24.06), even before the switchover from intraday to interday
- Custom events (Out of stock, itemViews) tend to have a lower degree of completion at equivalent timing
Further analysis & Takeaways
Actually, this simple query provides that information:
SELECT
table_name,
creation_time,
CAST(creation_time AS DATETIME) AS creation_time
FROM
`nameofyourdataset.analytics_xxxx.INFORMATION_SCHEMA.TABLES`-- <--change name of dataset to your own
WHERE
table_name BETWEEN 'events_202402%' AND 'events_202408%'-- <--change period to what you need
AND table_name NOT LIKE '%intraday%'
ORDER BY
table_name ASC;
We applied this query to our case and here is what we learned:
Between May and June, tables began to be created later and later. As a result, our scheduled query started querying the data “too early”, triggering anomaly emails.
In mid-July, we changed the query time from 09:00 AM to 12:00 AM, which only partially solved the problem.
In fact, we can see here, for example, that shifting the query to 06:PM would enable us to have the interday data completed in all cases.
To ensure that this was not an isolated case for a single customer, we extended the search to 5 other clients, all with a GA4 360 account (unlike customer A with a free account).
We can see that the general trend is broadly the same.
Time of creation has been occurring later and later over the last 3 months (around 9 AM on average 6 months ago, now after 10AM).
Also, looking at the distribution of hours, we can see that:
- The time range varies from 3 AM to 6 PM! This shows how unpredictable GA4 can be.
- +70% of the tables are created between 6 AM and 10 AM, yet the completion time remains highly variable.
To Go Further…
It should be noted that these investigations were conducted with a free BigQuery account, and results may differ with a Premium account.
By the way, since April 24, GA4 360 now supports ‘Fresh daily‘ BigQuery export which enables companies to receive periodic batch updates of the full dataset, completed by a guaranteed time each day.
Another option is also to be triggering a job based on the completion of the intraday event. In other words, it involves obtaining GA4 data in BigQuery as soon as it is complete and the export is finished. This is the principle of Pub-Sub that we will experiment with soon.
Stay tuned!