Google Maps Location History with BigQuery

Google Maps can be configured to record your location and show your history over time. If you don’t mind the privacy implications of this, it’s a powerful tool to see where you’ve been and how often you go to certain places. You can also use it to calculate historical travel times, visualize road trips, and so on.

Unfortunately, the stock interface leaves something to be desired. It’s hard to plot paths or to search by locations. You can’t get a list of all the times you’ve been to a certain place. There are some pretty interesting insights about your habits and behaviors that are locked up inside the interface.

I was thinking about the impact that quarantine has had on my travel habits. The stay-at-home order in California took effect here around March 19, but we had imposed mandatory work-from-home on the 16th. The only way I could see that in aggregate on Google Maps was the graph of visited locations.

Column graph showing number of places visited by day
Well, that’s depressing.

That got me thinking, is there a way to visualize this data over a longer period of time? Could I do more interesting things like calculate the maximum distance traveled from home on any given day?

Google Takeout

It is possible to get your location history exported as a single JSON. To do that, go to https://takeout.google.com/settings/takeout and click “Deselect All”. Then scroll down to Location History, recheck it, and make sure the settings look like this:

Options for Location History export

Export your data, and it will begin the process. It can take several minutes to hours to do this, but in my experience it was usually less than five minutes. Download the JSON file into a new folder on your desktop somewhere.

Loading to BigQuery

Unfortunately, the JSON that Google Takeout outputs isn’t directly compatible with BigQuery. BigQuery uses JSONL, which is basically a file consisting of a separate JSON object per line. This is so you can load massive datasets into BigQuery without invoking a parser. I guess Takeout didn’t get the message, because it generated one giant 600MB JSON object that choked several parsers as I tried to view it.

So. I guess we have to do some preprocessing on the file before BigQuery will accept it. Guess I’m breaking out the Node. I’m not sure how I’d want to handle this if the file were, say, 600GB, but we can handle 600MB locally without too much hassle. I just cobbled a quick script together that converts a history JSON file into a JSONL file.

const fs = require('fs')
const filename = './lh.json'

let jsonr = fs.readFileSync(filename);
let jp = JSON.parse(jsonr);
let array = jp.locations;

let newfile = filename.replace(/.json$/g, '.jsonl')
let stream = fs.createWriteStream(newfile);
array.map(x =>
    { 
        let y = { timestamp: parseInt(x.timestampMs / 1000),
                  lat: x.latitudeE7 / 10000000,
                  lon: x.longitudeE7 / 10000000,
                  accuracy: x.accuracy };
        // foreshadowing: think about timezones...

        stream.write(`${JSON.stringify(y)}\n`)
    
});
stream.end();

All this does is take a file named lh.json that holds the JSON-formatted Google Takeout file, loads it into memory, and converts each line to a separate JSON object. I also preprocessed the timestamps and latitude/longitude pairs into decimals for easier GEOGRAPHY conversion. This also has the effect of making the result file substantially smaller by shortening the names — the resulting file was only 203MB. I didn’t bother to gzip it, but I also have a fast Internet connection. If you don’t, gzip the file now before the next step.

Then I transferred it directly to Google Cloud Storage for loading into BigQuery.

gsutil -m cp lh.jsonl 'gs://my-bucket/lh.jsonl'

After that, it’s a simple bq call to get the data loaded into BigQuery.

bq load --source_format=NEWLINE_DELIMITED_JSON \
        --time_partitioning_field=timestamp \
        location.history \
        gs://my-bucket/lh.jsonl \
        timestamp:TIMESTAMP,lat:FLOAT,lon:FLOAT,accuracy:INTEGER

After I did that, I added additional columns for GEOGRAPHY and for the date in my local timezone — and then I realized my error. The partitioning and clustering aren’t going to help if they are always off by seven or eight hours, and since I want to look at data day-by-day, this could be a big problem. Well, how bad can it be…?

SELECT * FROM location.history
WHERE timestamp BETWEEN ‘2017-01-01’ AND ‘2017-01-02’
This query will process 246.2 KB when run.
SELECT * FROM location.history
WHERE date BETWEEN ‘2017-01-01’ AND ‘2017-01-02’
This query will process 260.8 MB when run.
Oh, only a thousand times worse.

That bad. Luckily, this is easy enough to fix by migrating the data to a new table and deleting the old one.

CREATE TABLE 
location.history2  (date DATE,
                    timestamp TIMESTAMP,
                    lat NUMERIC,
                    lon NUMERIC,
                    accuracy INT64,
                    point GEOGRAPHY)
PARTITION BY date;


INSERT INTO location.history2
SELECT date, timestamp, lat, lon, accuracy, point
FROM location.history;

DROP TABLE location.history;
-- bq cp location.history2 location.history
DROP TABLE location.history2;

Then I copied history2 back to history — this can be done via command-line or the UI, so I just used the UI to copy the table. Back in business!

Querying the Data

There are lots of ways to skin this cat. I thought the most interesting way to look at this, rather than distance from home, was to calculate the maximum distance inside the smallest polygon that encompassed all of my locations for the day.

INSERT INTO location.history_date
SELECT date, 
ST_CONVEXHULL(ST_UNION_AGG(point)) convex_hull, 

-- the maximum distance between two equivalent hulls is the longest vertex
-- inside of that hull
ST_MAXDISTANCE(ST_CONVEXHULL(ST_UNION_AGG(point)),
               ST_CONVEXHULL(ST_UNION_AGG(point))) max_distance

FROM location.history
WHERE accuracy < 100
GROUP BY date

This is a little abstract, but I had a couple thoughts in mind. One, I’ve moved several times since I started collecting location data, which means that distance from “home” is arbitrary across time. Two, I travel a lot too, and I want to show relative motion. So while a day I go to Chicago will have a large value, if I stay in Chicago the next several days, the values should be small again to reflect that I didn’t travel as much during the 24-hour period. This worked basically as I expected. For example, here’s the day I first saw Hamilton in Chicago:

From the excellent BigQuery geo viz at https://bigquerygeoviz.appspot.com/.

This isn’t totally accurate, and the “convexification” eliminates internal boundaries, but it’s close enough for our purposes, which is just to determine the max length of a line inside the shape.

Applying this to the table on a daily basis and calculating a 7-day median value generates this graph for 2020:

After traveling cross-country around New Year’s, it was back to the daily commute with exceptions here and there. Then, during the week of 3/23, quarantine began and mobility dropped to zero. There’s a brief period of activity as restrictions were lifted in mid-June, but since then… not going anywhere.

As you’ve no doubt realized, this is the tip of the iceberg. The JSON export from the beginning also contains the semantic location data, which includes mode of travel with waypoints, inferred location with name, address, and confidence values, and the Google Place ID, which you can use to categorize all the locations.

That was sure a lot of work to make me feel depressed about the fact that I haven’t gone anywhere in four months.