June 2020 – virtu.is

Google Maps can be configured to record your location and show your history over time. If you don’t mind the privacy implications of this, it’s a powerful tool to see where you’ve been and how often you go to certain places. You can also use it to calculate historical travel times, visualize road trips, and so on.

Unfortunately, the stock interface leaves something to be desired. It’s hard to plot paths or to search by locations. You can’t get a list of all the times you’ve been to a certain place. There are some pretty interesting insights about your habits and behaviors that are locked up inside the interface.

I was thinking about the impact that quarantine has had on my travel habits. The stay-at-home order in California took effect here around March 19, but we had imposed mandatory work-from-home on the 16th. The only way I could see that in aggregate on Google Maps was the graph of visited locations.

Column graph showing number of places visited by day — Well, that’s depressing.

That got me thinking, is there a way to visualize this data over a longer period of time? Could I do more interesting things like calculate the maximum distance traveled from home on any given day?

Google Takeout

It is possible to get your location history exported as a single JSON. To do that, go to https://takeout.google.com/settings/takeout and click “Deselect All”. Then scroll down to Location History, recheck it, and make sure the settings look like this:

Export your data, and it will begin the process. It can take several minutes to hours to do this, but in my experience it was usually less than five minutes. Download the JSON file into a new folder on your desktop somewhere.

Loading to BigQuery

Unfortunately, the JSON that Google Takeout outputs isn’t directly compatible with BigQuery. BigQuery uses JSONL, which is basically a file consisting of a separate JSON object per line. This is so you can load massive datasets into BigQuery without invoking a parser. I guess Takeout didn’t get the message, because it generated one giant 600MB JSON object that choked several parsers as I tried to view it.

So. I guess we have to do some preprocessing on the file before BigQuery will accept it. Guess I’m breaking out the Node. I’m not sure how I’d want to handle this if the file were, say, 600GB, but we can handle 600MB locally without too much hassle. I just cobbled a quick script together that converts a history JSON file into a JSONL file.

const fs = require('fs')
const filename = './lh.json'

let jsonr = fs.readFileSync(filename);
let jp = JSON.parse(jsonr);
let array = jp.locations;

let newfile = filename.replace(/.json$/g, '.jsonl')
let stream = fs.createWriteStream(newfile);
array.map(x =>
    { 
        let y = { timestamp: parseInt(x.timestampMs / 1000),
                  lat: x.latitudeE7 / 10000000,
                  lon: x.longitudeE7 / 10000000,
                  accuracy: x.accuracy };
        // foreshadowing: think about timezones...

        stream.write(`${JSON.stringify(y)}\n`)
    
});
stream.end();

All this does is take a file named lh.json that holds the JSON-formatted Google Takeout file, loads it into memory, and converts each line to a separate JSON object. I also preprocessed the timestamps and latitude/longitude pairs into decimals for easier GEOGRAPHY conversion. This also has the effect of making the result file substantially smaller by shortening the names — the resulting file was only 203MB. I didn’t bother to gzip it, but I also have a fast Internet connection. If you don’t, gzip the file now before the next step.

Then I transferred it directly to Google Cloud Storage for loading into BigQuery.

gsutil -m cp lh.jsonl 'gs://my-bucket/lh.jsonl'

After that, it’s a simple bq call to get the data loaded into BigQuery.

bq load --source_format=NEWLINE_DELIMITED_JSON \
        --time_partitioning_field=timestamp \
        location.history \
        gs://my-bucket/lh.jsonl \
        timestamp:TIMESTAMP,lat:FLOAT,lon:FLOAT,accuracy:INTEGER

After I did that, I added additional columns for GEOGRAPHY and for the date in my local timezone — and then I realized my error. The partitioning and clustering aren’t going to help if they are always off by seven or eight hours, and since I want to look at data day-by-day, this could be a big problem. Well, how bad can it be…?

SELECT * FROM location.history WHERE timestamp BETWEEN ‘2017-01-01’ AND ‘2017-01-02’	This query will process 246.2 KB when run.
SELECT * FROM location.history WHERE date BETWEEN ‘2017-01-01’ AND ‘2017-01-02’	This query will process 260.8 MB when run.

Oh, only a thousand times worse.

That bad. Luckily, this is easy enough to fix by migrating the data to a new table and deleting the old one.

CREATE TABLE 
location.history2  (date DATE,
                    timestamp TIMESTAMP,
                    lat NUMERIC,
                    lon NUMERIC,
                    accuracy INT64,
                    point GEOGRAPHY)
PARTITION BY date;


INSERT INTO location.history2
SELECT date, timestamp, lat, lon, accuracy, point
FROM location.history;

DROP TABLE location.history;
-- bq cp location.history2 location.history
DROP TABLE location.history2;

Then I copied history2 back to history — this can be done via command-line or the UI, so I just used the UI to copy the table. Back in business!

Querying the Data

There are lots of ways to skin this cat. I thought the most interesting way to look at this, rather than distance from home, was to calculate the maximum distance inside the smallest polygon that encompassed all of my locations for the day.

INSERT INTO location.history_date
SELECT date, 
ST_CONVEXHULL(ST_UNION_AGG(point)) convex_hull, 

-- the maximum distance between two equivalent hulls is the longest vertex
-- inside of that hull
ST_MAXDISTANCE(ST_CONVEXHULL(ST_UNION_AGG(point)),
               ST_CONVEXHULL(ST_UNION_AGG(point))) max_distance

FROM location.history
WHERE accuracy < 100
GROUP BY date

This is a little abstract, but I had a couple thoughts in mind. One, I’ve moved several times since I started collecting location data, which means that distance from “home” is arbitrary across time. Two, I travel a lot too, and I want to show relative motion. So while a day I go to Chicago will have a large value, if I stay in Chicago the next several days, the values should be small again to reflect that I didn’t travel as much during the 24-hour period. This worked basically as I expected. For example, here’s the day I first saw Hamilton in Chicago:

From the excellent BigQuery geo viz at https://bigquerygeoviz.appspot.com/.

This isn’t totally accurate, and the “convexification” eliminates internal boundaries, but it’s close enough for our purposes, which is just to determine the max length of a line inside the shape.

Applying this to the table on a daily basis and calculating a 7-day median value generates this graph for 2020:

After traveling cross-country around New Year’s, it was back to the daily commute with exceptions here and there. Then, during the week of 3/23, quarantine began and mobility dropped to zero. There’s a brief period of activity as restrictions were lifted in mid-June, but since then… not going anywhere.

As you’ve no doubt realized, this is the tip of the iceberg. The JSON export from the beginning also contains the semantic location data, which includes mode of travel with waypoints, inferred location with name, address, and confidence values, and the Google Place ID, which you can use to categorize all the locations.

That was sure a lot of work to make me feel depressed about the fact that I haven’t gone anywhere in four months.

The wizarding world is notoriously haphazard. If a muggle school ran like Hogwarts, it’d be sued out of existence within the first term. Law and order is a one-way ticket to Azkaban. And the sorting hat labels eleven year olds for the rest of their lives as brave, sneaky, intelligent, or… well, whatever Hufflepuff is.

A major consideration for machine learning models is the concept of “interpretability”, namely, can you understand and explain why the model makes certain choices. The sorting hat has almost no interpretability, except on the rare occasions that it stubbornly defends its own choices as accurate. Can we make something that chooses as well as the Sorting Hat does, but is also interpretable?

First of all, in machine learning terms, it’d be called a “Classifying Hat”, since the type of ML model used to predict a choice given a set of input features is a multi-class logistic regression. So let’s build ourselves a Classifying Hat.

To do this, we’ll use Google Cloud Platform’s BigQuery Machine Learning capabilities, because you can do it in pure SQL directly from Google Sheets, and it’s really easy to train and predict. Since this isn’t a serious analysis (if that wasn’t obvious) there’s no reason to go crazy with the feature engineering. I follow pretty much the same set of steps for all BQML models anyway.

1. Find the Data

First we’re gonna need some good old Harry Potter data. I want a mix of features that would presumably be helpful in prediction, but nothing so detail-oriented that it would be hard to replicate for arbitrary sorting (i.e., if you wanted to see how the Classifying Hat sorts you.)

I settled on a mix of sources:

This Kaggle dataset has names, genders, wand, birthright, eyes, and hair. Since house selection seems to be significantly related to blood status and pedigree, this will probably help a lot. You also receive your wand before you get sorted, so there’s probably some useful predictive capability there.
This remarkable Harry Potter aggression analysis weighting characters in the book based on how often they appear and how aggressive their actions are. Since personality type is clearly a factor in sorting, this will be helpful. (Of course, we don’t know cause vs. effect here! It’s possible that being labeled so early in life influences your later actions.)
This compilation of related datasets includes a graph showing the relationships among all of the characters. I ended up not using this. Similar to how Facebook works, you could use this to predict a house for someone who “isn’t on the network” by seeing who they associate with. I didn’t use it because the books cover only the time students are at Hogwarts, and by definition they are going to be closely associated with others in their house, since they share sleeping and living quarters. This would be far more useful on the wizarding social network (“Facegrimoire”?) where you could easily determine someone’s house affiliation based on their connections.

I also pulled scripts from the first few movies and the texts of the books, for a secondary analysis on how people’s vocabularies are related to their house selections. That can be done at a later date (yeah, okay, I’m never doing that).

2. Prepare the Data

Now that we have enough data to start with, we’ll get it into a format we can create a machine learning model from.

The Kaggle dataset, while being super useful, was not well formatted, having semicolon delimiters and a ton of jagged arrays. I preprocessed it in Google Sheets and generated a table with blank cells instead of “Unknown”s, and cleaned up overly descriptive labels to make them easier for the model to bucket.

Character names also didn’t match among the various files due to spelling, use of aliases, alternate names, or extra middle names, so I used the aggression dataset’s convention of assigning a four letter slug (i.e. “harr” for Harry Potter, “tonk” for Nymphadora Tonks, “vold” for Tom Riddle, and so on.) and matched up the data.

Since there are only a couple hundred rows, I opted to do this largely manually using Google Sheets, where I could edit all of the values directly. As we’ll see in a second, it’s possible to run BQML models directly out of Google Sheets, which is pretty neat.

3. Load the Data

I placed each filtered “table” into a separate sheet of the Google Sheets file (you can choose to load from a specific sheet). Then I copied the URL to the clipboard and loaded the tables into BigQuery.

I didn’t realize it at first, but when you make a table from a Google Sheet, it’s actually a federated source. That means that it’s actually querying live from the sheet whenever you access the data. This also means there isn’t really any transactional integrity, so if you are making changes to the sheet while the query is running it sometimes generates internal errors.

Note also that whatever schema you supply will govern how far out the table goes. So if you only define three columns but your sheet has twenty, the table will only have three columns. If you edit the schema then the columns will automatically get populated with data from the corresponding sheet columns. Kind of weird.

I prefer to use a view as an input to my machine learning models so I can tweak the view without rebuilding the model, and can also look at the raw input. The CREATE MODEL statement can take several minutes, even on small datasets, and it’s easier to do preliminary feature selection and extraction before getting there.

Accordingly, my final CREATE VIEW statement looked like this. I didn’t bother to specify column names in the table schema and am just transforming them back out via the view.

CREATE OR REPLACE VIEW `hp.input`
AS
SELECT
string_field_0 ID,
string_field_1 named,
string_field_2 gender,
string_field_3 role,
string_field_4 house,
string_field_5 wand_length,
string_field_6 wand_wood,
string_field_7 wand_animal,
string_field_8 patronus,
string_field_9 race,
string_field_10 blood,
string_field_11 hair,
string_field_12 eyes,
string_field_13 loyalty,
string_field_14 skills,
string_field_15 birthdate,
fullname,
(SELECT SUM(mentions) FROM hp.aggression a WHERE a.abb = h.string_field_0) mentions,
(SELECT AVG(agg_weighted) FROM hp.aggression a WHERE a.abb = h.string_field_0) agg_weighted,
FROM hp.houses h
WHERE string_field_0 IS NOT NULL

Views (non-materialized ones, anyway) are still federated, so this will pull all the data together from multiple sheets and transform it every time you SELECT from it. That convenience will come in handy later.

4. Build the Model

Now I was ready to actually create and train the model. This meant transforming and selecting my final feature set and filters and building up a model. Per the above, I knew I wanted a multi-class logistic regression. This is a supervised model, which means I have to label input data with the correct class and use that to train the model. The typical split is 80/20, but for datasets with fewer than 500 rows, BQML defaults to using all the rows as training data. This isn’t great, since it raises the chances of overfitting. However, only ninety-odd people have been assigned to houses, so it’s all we get. (I was not inclined to sift through reams of fan fiction to plausibly tag any others.)

My first model creation statement:

CREATE OR REPLACE MODEL `hp.hm`
OPTIONS (model_type='logistic_reg', input_label_cols=['house'])
AS
SELECT * EXCEPT (id) FROM `hp.input` WHERE house IS NOT NULL

5. Evaluate the Model

The model ran for a few minutes (no hatstall here) and spit out the new model object. I went to the model evaluation, and lo and behold:

First attempt at ML sorting — The first model’s confusion matrix.

Hm. That smells like overfitting. Did I do something wrong? I was careful to remove ID numbers and tags that might bias the model. You could make a case that name shouldn’t be in there, and I did try that variation, but you could also make an argument that bias by last name is intentional by the Sorting Hat. Weasley? Gryffindor. Potter? Gryffindor. Malfoy? Slytherin. That may not be fair, but it probably does accurately reflect how the hat works.

Nonetheless, it still seemed a bit too on-the-nose, so I went back and checked what other columns I might have missed. Well, “loyalty” sure stood out. It turns out that if your loyalty includes “Death Eaters”, you are overwhelmingly likely to be in Slytherin. I nixed that column. I also nixed “species”, which was basically all human except for Hagrid, who was listed as “half-giant”, and immediately caused all non-humans to go to Gryffindor too. (That might also be right.)

For my second attempt, I made a few more changes. I compromised on sending only last name instead of full name. I also bucketized wand length by 2, so that 8-10 inches was a group, 10-12, and so on. Lastly, I removed birthdate extractions entirely. I thought zodiac sign or decade might be interesting, but the data is too vague to be of much use.

CREATE OR REPLACE MODEL `hp.hm`
TRANSFORM
(
named,
gender,
ML.QUANTILE_BUCKETIZE(CAST(wand_length AS NUMERIC), 2) OVER() wand_length_bucketized,
wand_wood,
wand_animal,
blood,
hair,
eyes,
house,
)
OPTIONS (model_type='logistic_reg', input_label_cols=['house'])
AS
SELECT * EXCEPT (id, role, race) FROM `hp.input` WHERE house IS NOT NULL

A bit concerning; it’s still basically perfect.

If this were a real-world data set, I would now be extremely concerned about implicit bias and the possibility that the model was making judgments purely superficially in a totally unfair way. I do think that’s happening, but I also think the Sorting Hat does too (and this kind of proves it) so we’ve reached the standard for novelty level.

As a last step, I took surname out of the model just to see how much that was really skewing things.

I guess there really is a secret class system.

6. Make Some Predictions!

Making a prediction is easy: we just select from ML.PREDICT with our model, and it’ll add the target feature prediction to the table. We have a bunch of characters in the table who don’t have houses, either because it was never defined, they’re in schools other than Hogwarts, or they’re muggles. Who cares, let’s predict the rest of the table.

SELECT house, predicted_house, fullname 
FROM 
ML.PREDICT(MODEL `hp.hm`, (SELECT * FROM `hp.input`))
ORDER BY house NULLS FIRST

This yields some entertaining and plausible results such as sorting Dobby the house elf into Hufflepuff and Nurse Pomfrey into Slytherin. It also places all the Dursleys in Gryffindor, oddly. You can add your own examples and see how it makes its decisions, which is at least one thing the Sorting Hat can’t/won’t do!

Next Steps

I was recently playing around with the use of Google Forms to automatically drop rows into Google Sheets, and then to write some Apps Script to do things with that. It wouldn’t be too hard to take inputs from a Google Form and trigger an automatic BQML prediction based on new data. In other words, you could turn this ML classifying hat into a charming little quiz!

It would also be interesting to drop this into a k-means model and see if it segments people into four distinct houses, or if it has something else in mind.

The Google Sheet with my data is here, so if you want to give it a spin, all you need to do is make a copy to your own Google Drive and import it into a BigQuery table (use the rightmost sheet, “Houses”.) Have fun!

This analysis was used as the basis for a talk at SQL Saturday Los Angeles on June 13, 2020. It also draws from my book, BigQuery for Data Warehousing, out now!

Month: June 2020

Google Maps Location History with BigQuery

Google Takeout

Loading to BigQuery

Querying the Data

The Classifying Hat — An ML Harry Potter Sorting Hat with BigQuery

1. Find the Data

2. Prepare the Data

3. Load the Data

4. Build the Model

5. Evaluate the Model

6. Make Some Predictions!

Next Steps