Google Maps Semantic History in BigQuery

In this post I talked about getting your location data from Google Maps and loading it into Google BigQuery. I mentioned that Google also provides its semantic history. This data structure includes Maps’ inferred place names and addresses and its guesses about mode of travel. It divides the data into two main event types, activitySegment and placeVisit.

placeVisit describes time spent stationary within a given latitude/longitude bound, and includes the Google Place ID, address, name, and a confidence score.

activitySegment includes start and end coordinates, as well as the waypoints and raw path, duration, and an activity type (“IN_PASSENGER_VEHICLE”, “IN_BUS”, “IN_TRAM”, et al.) with associated confidence score.

In order to load this data into BigQuery, I wanted to transform it to JSONL as before, but also to unify the two event types into a single structure so they can be queried together. Most of the important pieces, like location and duration, apply to both. This also allows for a single date partitioning scheme, so that you can query a single partition to get all activity for a single day. (You could easily store multiple people’s data in the same structure and continue to leverage the same partition scheme.)

Defining the Schema

The JSON schema / BigQuery table definition I decided on was:

dateDATELocal event date (adjusted for timezone)
typeSTRING“VISIT” for a placeVisit and “TRAVEL” for an activitySegment
latStartFLOATThe starting latitude coordinate of the event
lonStartFLOATThe starting longitude coordinate of the event
latEndFLOATThe ending latitude coordinate of the event (“VISIT” is the same as start)
lonEndFLOATThe ending longitude coordinate of the event (“VISIT” is the same as start)
moveTypeSTRINGThe activity type (“IN_VEHICLE”, etc.) “VISIT” always has “NONE”
distanceFLOATDistance traveled in meters during the activitySegment. 0 for “VISIT”.
semTypeSTRINGGoogle’s semantic type for the location. (“TYPE_WORK”, “TYPE_HOME”)
placeIdSTRINGGoogle’s place ID for the VISIT.
addressSTRINGThe geocoded address for the lat/lon pair for the VISIT.
nameSTRINGThe name of the location (“Wendy’s”, “Starbucks”, “The Colosseum”)
confidenceFLOATThe confidence score for the named location
timestampStartTIMESTAMPThe fixed UNIX epoch timestamp for the beginning of the event, in seconds
timestampEndTIMESTAMPThe fixed UNIX epoch timestamp for the end of the event, in seconds
durationINTEGER(timestampStart – timestampEnd), for convenience
latCenterFLOATVISITs have an extra lat/lon pair, which I loaded here just in case
lonCenterFLOATVISITs have an extra lat/lon pair, which I loaded here just in case
waypointPathGEOGRAPHYThe coordinates of any waypoints along the activitySegment
rawPathGEOGRAPHYA simplified linestring showing the path for activitySegments
pointStartGEOGRAPHYThe geography point for latStart/lonStart
pointEndGEOGRAPHYThe geography point for latEnd/lonEnd
pointCenterGEOGRAPHYThe geography point for latCenter/lonCenter
Tabular form of the schema.json file.

I kept the non-geographic columns to mirror the source data, but they’re redundant to ST_X(pointStart) or ST_Y(pointStart). Date, however, is not — whereas the raw timestamp values give the fixed start and end values for the activity, Date is localized to the appropriate timezone. In fact, using a nifty npm package called geo-tz, it will actually automatically look up the timezone in the lat/lon the event is in! So even as you move between timezones, it will still place the event in the appropriate local date. If it can’t find one, it just defaults to my local timezone, “America/Los_Angeles”.

Coding the Script

Unlike last time, I wanted both a one-stop shop and an idempotent module that I could run over and over again as I worked on the data, without having to use the UI to repeat manual steps. I wrote a nodejs script that performs the following steps:

  • Transform all semantic history files into JSONL format
  • Zip and load file to Google Cloud Storage
  • Create the BigQuery table according to the defined schema (dropping and rebuilding each time)
  • Load the GCS file to BigQuery
  • Perform any necessary post-processing in BigQuery to clean the data

There weren’t a lot of challenges here, just your regular data transformation pipeline. Google Takeout doesn’t have a native API (there are workarounds) so I didn’t bother writing a repeatable pipeline to refresh the data monthly or weekly. (And for those of you who read my book, note that despite writing the book in Python, I’m more comfortable in Node for personal projects. I chose Python for the book because it’s compatible across Jupyter notebooks, Dataflow, Google Cloud Functions, and App Engine and I didn’t want to be jumping around in languages. Despite that, I still ended up with Python, BQ SQL, ZetaSQL, MySQL SQL, Javascript, and Java…) A couple of interesting notes though.

While Takeout provides the regular location history as a single, massive file in JSON or KML, the semantic versions are segmented into monthly files, i.e. 2020_JANUARY.json, 2020_FEBRUARY.json, etc. I just used a single glob wildcard to grab all JSONs in the subfolders, so you can dump your “Semantic Location History” folder directly into the script folder. Don’t mix in any other JSONs, though because it just uses ‘./sem/**/*.json’ as the glob pattern.

There doesn’t seem to be any way to actually gzip files for use with the BigQuery Node SDK. I didn’t do a lot of research on this, but I couldn’t get the SDK to take the file if I gzipped it. The semantic files are much smaller — mine were 8MB unzipped and 2MB gzipped, so I didn’t fiddle too much with this and just commented it out.

Longitude/latitude are stored as E7 values in the Takeout JSON, so I simply divided by 1E7 to restore the normal decimal format. Timestamps are stored with millisecond precision, so I divided by 1000 to load them as BigQuery timestamp files.

Loading GEOGRAPHY types from JSONL is kind of interesting too. Loading ST_POINTs isn’t too bad; I just used a string template in the form `POINT(${lon} ${lat})` and BigQuery was happy. MULTIPOINTs and LINESTRINGs were a little harder, mostly because the data often repeats the same point or contains no unique points at all. The comma and parentheses are also strict, so it requires a bit of string finagling to get it working right. Ultimately I used these WKT string formats:

PointPOINT(30 10)
MultipointMULTIPOINT ((10 40), (40 30), (20 20), (30 10))
LinestringLINESTRING (30 10, 10 30, 40 40)
WKT formats BigQuery is okay with

You will need a GCP service account key with permissions to create/delete Google Cloud Storage objects and permission on BigQuery to run jobs, create, and delete tables in the dataset. Download it from the GCP IAM console and save it as service-account-key.json in the script folder to get access.

You also have to make your own GCS bucket, which you can do with gsutil or the Cloud console.

Running the Script

The script documents its progress, although it only takes about 45 seconds to run from my local machine with my file set. The majority of that time is spent in loading the file into BigQuery from GCS, which doesn’t take place on my machine anyway.

The script doesn’t have any real handling, so it’ll likely throw an UnhandledPromiseRejectionWarning if something goes wrong. If everything works correctly, you’ll be left with a yellow “Finished!” prompt and you can move into BigQuery.

Note also that there is an empty postprocessing step in which you could add any SQL you want to run to clean things up. For instance, some rows have starting and ending coordinates, but don’t have a distance, so you might want to UPDATE the ST_DISTANCE for those rows for cleaner distance statistics:

UPDATE `location.semantic_history`
SET distance = ST_DISTANCE(pointStart, pointEnd)
AND distance IS NULL

But now you have a neat table that contains all of your location history in an easily queryable form, with lots of helpful metadata.

Google Place IDs

Enabling the Places API and using an API call in the format{PLACE_ID}&key={GOOGLE_MAPS_KEY}

gives you all of the Google Maps data on a given place, including operating hours, price level, the geometry for the whole place, reviews, and a bunch of other stuff. You could join this data together with your location history for some really interesting queries like “how many places was I at after hours?” or “how many expensive restaurants do I go to per month?” I’ll tackle this part in a later post, using the Node SDK for Google Maps. You can cache this data too, so I can probably build some sort of rudimentary table that lets me collect and load places data wherever it exists.

Interesting Queries

Even with just duration, latitude, and names, I can do some pretty neat things with this data. Here are just a few examples I tried.

How many times do I go to Starbucks each week?

WITH weeks AS
     FROM UNNEST(GENERATE_DATE_ARRAY('2014-01-01','2021-01-01')) a),
     times AS
    (SELECT DATE_TRUNC(date,week) w, count(*) count
     FROM location.semantic_history
     WHERE LOWER(name) LIKE '%starbucks%'
     GROUP BY DATE_TRUNC(date,week))
SELECT weeks.w, IFNULL(times.count,0)
FROM weeks

Or any other place, just change the “WHERE LOWER(name) LIKE” to whatever clause you want. How about number of hours spent on planes?

SELECT SUM(duration)/3600 FROM location.semantic_history
WHERE moveType = 'FLYING'

SELECT rawPath from location.semantic_history
WHERE moveType = 'FLYING'

The second query can be passed directly to BigQuery Geo Viz for a full visualization of your flights. Google only began tracking the activity type somewhere in 2016, so data before that won’t have it. This also seems to be missing a bunch of flights I’ve taken since then, so it probably needs refinement. Or, how about all trips ordered by decreasing speed in miles per hour?

SELECT (ST_DISTANCE(pointStart,pointEnd)/duration)*2.23694 mph, * 
FROM location.semantic_history
WHERE distance > 200 AND duration > 0

Noting that the maps distance is sometimes wrong, I recalculated it to do this math. Even that yielded some amusing outliers, no doubt due to poor cellular or GPS signal. Apparently, on one New Year’s Eve, I traveled from SFO to the middle of the Pacific Ocean at 930 mph. (This appears to have been Google thinking I landed at SFO when I actually landed at LAX.)

Lastly, you can generate the Google Maps direction string for any trip — not the route you actually took, just the default route so you can visualize the locations:

SELECT FORMAT(',%s/%s,%s/', 
       CAST(latStart AS STRING), CAST(lonStart AS STRING), CAST(latEnd AS STRING), CAST(lonEnd AS STRING)) mapsString
FROM location.semantic_history

Using the Code

The code for all of this is available on github at Feel free to download and modify for your own use!