Recommended approach to importing and parsing GA4 BigQuery event data in SQL server

gudnpqoy  于 2023-04-04  发布在  SQL Server
关注(0)|答案(1)|浏览(141)

We are using Python to connect to BigQuery to download daily events_YYYYMMDD.csv files. Those files contain a comma separated list of columns, and some of the columns contain .json data (some of which contain embedded .json as well).

Each of those daily files contain approximately 1 million records.

Using the Google GA4 Events Schema, we are building the GA4 related tables in SQL server.

  • event
  • user
  • device
  • geo
  • app_info
  • traffic_source
  • stream and platform
  • ecommerce
  • items

What would be the recommended manor of parsing the .csv files (which can contain multiple levels of embedded .json) into the appropriate tables (listed above) in SQL server?

We started to create a SSIS package to import the .csv file into a staging table, then create a script that would parse the rows. However, upon doing so, we realized we'd need to create a pretty complicated query containing multiple cross apply statements and were concerned about performance.

wnavrhmk

wnavrhmk1#

The best practice is to flatten your BQ GA4 export before exporting it. You can find various approaches on the web to flattening it, but the main idea is to get rid of json in your event property/custom dimension column. Typical solution is to make a distinct column for each ep/cd.

After this, people typically just move the data into their preferred data storage as is. Any mutations in ETL at this point would be company-specific. And normally, the ETL logic gets bloated over the years, always aimed at improving the data.

相关问题