virtu.is

Is using ChatGPT to write breaking your brain?

by Mark2025-06-242025-06-24 No comments

This MIT study seems to point that way. And it certainly reflects a feeling I’ve had a lot lately. What’s the cost when instead of pushing through writer’s block, we just route around it? I have to start writing “brain-only” again, per the study.

Look, I’m not opposed to LLM-assisted text generation. I’ll make a unique claim, which is that I used GPT (albeit minimally) to write a sentence in my book five years ago.

Originally, I kept going, opining GPT would soon write paragraphs and maybe even whole pages. Then I edited it out because I found it overly effusive and unlikely to age well. Turns out even my wildest prediction fell short.

I’m already inundated by LLM writing. I use that word with the etymological connotation of drowning. It’s weird and creates discontinuity. It’s opaque: I can’t read between the lines to connect with the writer. And it’s not (yet) a more refined version of an individual’s expressive capability. It lands same-same: not the organic evolution of an idea. Some statistical representation of that idea with the rough edges smashed in through extrusion. (“You’re absolutely right!” I hear GPT respond in my head, despite explicitly not being consulted.)

Two things. One — despite my use of em-dashes, which I’ve regretfully cut back on — this and all of my subsequent musings will be written without AI. I don’t think this makes me a luddite (perhaps a Luddite; more on that later.) Rather, we should do this for the same reason we do crosswords without dictionaries, go for a jog despite cars, or draw when we have cameraphones. We shouldn’t allow our physical strength to diminish solely because we can offload it to tools. The same is equally true for our cognition, which, after all, is also a physical process. An AI can simulate ambition; we still have the market cornered on endeavor.

Two: LLMs do not speak English. They do not speak language. They model it, right there in the name. (“You’re making a very valid point, and it’s one few people understand,” interjects GPT.) Language is just the veneer, the domain language over statistical embedding. And yet, even that take borders on dangerously reductive. LLMs can’t be dismissed as complicated math problems. To wilfully ignore their potential is to disadvantage oneself. Change is coming, for the better, and the worse. Inevitable.

So. If you’ve already grown tired of digesting all the AI pabulum out there… every book written before 2020 awaits. You’re set for a lifetime. On the other hand, if you, like me…

Wordle

The Map of Wordle

by Mark2022-01-082022-01-08 1 comment

I’ve been busy doing all kinds of Wordle analysis — as I mentioned in the last post I’ve been testing various strategies to assess the “difficulty” of Wordle. This led to some interesting speculation about which word is best to guess first. Obviously, the best word to guess first is the correct word, but you can’t do that better than chance (unless you have the oracle, in which case you’re cheating.)

Another interesting question I had is, how many possible responses are there to a guess? The entire set is { green, yellow, nothing } across five letters, so 243. However, five of those possibilities are invalid, since one yellow can’t happen — if you have four of the letters right, by definition the fifth letter cannot be out of position.

So I mapped every possible guess to every possible answer to look at how many responses each first guess gives, on average, against every answer. This is an answer set of (2315 * 2315), or about 5.3 million possible guess/answer combinations. I did a bunch of analysis I’ll share later, but then I got an interesting idea. Can we visually plot this map?

The image above is exactly that — a 2315×2315 image showing every guess along the X axis and every answer along the Y axis. I simply averaged the (r,g,b) values across the letters. So five yellows becomes pure yellow, the correct answer becomes pure green, no correct letters is black, and all other combinations generate some color in the yellow/green spectrum. The idea was simple enough. I quickly ran through the file and generated a P3 PPM, then converting it to PNG on the command line.

I expected to see a bright green diagonal line and sparser areas away from it. I also reasoned that guess/answer pairs would be sort of symmetrical, in that guessing A for B would give something similar to guessing B for A.

What I did not expect was this rich map of words colliding into the diagonal axes, pronounced squares coming out of the diagonal axis, and yellow and green swirls showing words that are probably related. This isn’t a map of how closely related words are to each other in any sense other than letters in their positions, and it only uses the words that can be actual Wordle answers, not the full guess set. So it really is a map of how Wordle starts, the very top of the decision tree in two dimensions.

I want to look at it more closely and talk about what generates the columns, but I was taken aback by the simple beauty of this image and wanted to share it before doing anything else. Enjoy 🙂

UPDATE: Since this image represents one word in each axis with a single pixel, I realized that the sizes of the squares probably correlates to the letters of the alphabet. I applied a hue modification to each letter combination, which produced this:

Same image — one pixel per guess, but colors have been tinted by the first letters of the words they represent.

This image is less helpful from a word perspective, but it sure is pretty!

Wordle

Wordle Strategies Analysis (Part I)

by Mark2022-01-062022-01-07 No comments

In my last post, I began to explore the inner workings of Wordle. As a friend pointed out, it has all the hallmarks of a good game, because even understanding the decision trees didn’t diminish the fun I had playing it this morning as usual.

I added a bunch of stuff to my script to allow for any solver algorithm to test every five-letter word and get the overall performance of any given strategy. Strategy is kind of an interesting thing with respect to this game, as there are a finite number of answers and they don’t shift too much over time. So there isn’t really any advantage to writing a general-purpose strategy that works on all 11 million five letter combos when there are only 8,938 words in the list.

Dictionaries

It’s probably worth going into some detail on what the potential word lists even are. Thanks to some sleuthing by Tom, we can actually see the answer list Wordle uses inside its code. The New York Times article on the Wordle phenomenon also tells us helpfully that the creator’s partner, Palak Shah, went through the full list and selected about 2,500 words familiar enough to be used as answers.

This is a similar approach to the New York Times’ own Spelling Bee, which is at least in my opinion aggravatingly arbitrary about which words it will accept. Luckily in Wordle’s case, fewer words is better, and it does accept the larger list in guesses, even though that means 80%+ of valid guesses can never be the answer. (That’s okay, since we already learned that gray letters are our friends, probably worth exploring more deeply.)

Our candidate dictionaries, therefore, are something like the following:

Collins Scrabble Words, the official word list used by Scrabble players in most of the world (notably, not the US.) As the article describes, this word list consists of lists merged together from British and American sources. Wikipedia says this list has 12,972 five-letter words, although I haven’t counted them yet.
OTCWL2, the shorter dictionary I chose for my solver script. This dictionary has 8,938 five-letter words, omitting the most obscure (and offensive) choices while leaving a lot of latitude to produce guesses.
Wordle’s full list of valid guesses, which has 10,657 options. I’m not sure what the source is, but it didn’t match with any of the other word lists I had available. (Perhaps some manual removals for various reasons?)
Palak Shah’s list, the hand-picked selection of 2,315 words (that’s six years, four months of daily games) that comprises all possible answers. WARNING: this list goes through the answers IN ORDER. Unless you want to ruin Wordle for yourself forever, don’t read it. I alphabetized the version attached to my script to avoid this horror, but not before realizing in checking the list that I’d spoiled the next several days for myself. (Appeal to Mr. Wardle — could you at least hide a hash algorithm in there to make it slightly harder to see all the answers?)

I don’t know how aligned the actual answer list is with word frequencies. Is it cheating on the solver’s part to only guess words that might be the answer, if that word list is missing some common words that might otherwise be good guesses? (I think this probably skirts the line between general- and specific-purpose?)

Solver Strategies

Well, as I’ve just learned, the best solver strategy would be to calculate the number of days since June 19, 2021, index to that line of the answer list, and guess that. Also eating my words somewhat here about the value of a general-purpose algorithm, since a specific-purpose algorithm now has perfect performance in the real world.

But let’s elevate ourselves from the actual Wordle to the platonic ideal of Wordle, where the answer could be any five-letter word from the list, and look at some actual strategies. I came up with a couple of obvious approaches to start.

Naive — use the same strategy to pick each next guess. For example, always choose the first valid answer, the last valid answer, or the valid answer from the center of remaining possibilities. (You could also choose non-deterministically, but that won’t give consistent metrics over runs.)
Naive with naive optimization — choose the next word without reference to anything except the list, but pick a word most or least similar to previous guesses by some criterion. For example, always choose the word with the fewest seen letters.
See all letters as quickly as possible — some people use a strategy to ignore the information from the first word and guess five new letters for the second word. My solver isn’t designed to do anything except use all information from previous clues, so I’m not directly testing this, but it has some interesting characteristics. For example, floor barring lucky guesses is always 3, even if you could reasonably get it in 2. But on average it should perform better because it avoids the long tail of 1 unknown letter and many possibilities (i.e., 🟩🟩⬜🟩🟩: CAFES, CAGES, CAKES, CANES, CAPES, CARES, CASES…) This statement should be equivalent to the statement “solver always plays in hard mode”, unless there are bugs. I haven’t verified that.
Letter distribution — choose the next word so as to maximize the most common letters remaining.
Decision tree pruning — choose the next word which eliminates the most possibilities given the available information. Recursively: do this all the way to the leaf nodes and pick the most promising tree. This is fairly standard game theory.

When getting at the objective “difficulty” of Wordle, the performance of naive strategies is pretty interesting. Let’s also define performance to give ourselves some sort of baseline. These statistics will apply to a given strategy’s results across the entire list of five-letter words (

Average, median, and range — it seems the most basic metric is just the average and median number of guesses the solver takes. It’s also useful to know the minimum and maximum number of guesses the solver may take, in case it has incredibly poor (or good) outlier stats.
Number (or percent) of failures — since the goal of Wordle is to guess the word in six or fewer guesses, the frequency with which the strategy can’t do that is also interesting.
Number (or percent) of successes in n=4,3,2… — it’s also good to know the distribution of successes across the three major levels. 1 guess is a fluke; 5 and 6 are successful but not especially exciting.

Total for these sequences is 8,938, the number of five-letter words in my arbitrary word list selection.

Naive Strategy Statistics

I ran three naive strategies to assess baseline performance. There’s no real calculation here — a player using this strategy would have perfect knowledge of how to play the game, perfect knowledge of the corpus, and yet select answers that are likely bad for other reasons — too obscure, poor remaining letter distribution, no educated guesses at the presence of double letters, et al. Note that despite this, all strategies will always use all information from previous clues to make their next guess, which makes them more effective than one might guess.

In the next post, I’ll go over the statistics for the naive strategies and begin to address the question “how ‘hard’ is Wordle?” As a spoiler, these naive solutions all have success rates in the 80% range, meaning that having perfect knowledge and terrible strategy still allows you to win most of the time.

Wordle

Analyzing (and Solving) Wordle

by Mark2022-01-052022-01-07 No comments

I published some proof-of-concept Python code today while thinking about the inner workings of Wordle. Plenty has already been written about the game itself, its origin and appeal, and some solvers have been written as well. While my verbal brain super appreciates that, I’ve been doing a lot of Sudoku lately and the similarities in problem solving were fascinating. I got thinking really hard about how Wordle is put together from a math and probability standpoint.

What are the building blocks?

Wordle combines elements of a lot of games I enjoy — Mastermind, Wheel of Fortune, crosswords, and so on. But it’s really none of these, and that’s what makes it so interesting.

Superficially, it seems most similar to Mastermind. But with Mastermind, all potential combinations are equally likely, in theory, anyway. When playing against a human opponent, they may choose patterns they expect the player to fail to think of, and there’s some Roshambo-like strategy in faking someone out. Unlike Wordle, though, no one agrees ahead of time that only certain patterns are acceptable. In fact, there’s no way (apart from Sudoku and its variants, more on that later) to achieve that ahead of time except on something like a language corpus. So there are immediately interesting constraints that everyone can understand without a complicated ruleset.

It’s also like Hangman or Wheel of Fortune in one interesting way. Obviously neither of those games have the “not in this position” mechanic, but they do have another thing that other word games are less likely to have, leading to the thing I find really interesting about this game.

Negative information

Wordle is interesting because even guessing words that don’t match at all provides useful information about what does remain. Wheel of Fortune and Hangman have this concept, in that you know some things that won’t fill the available letters, This is the source of the humor for most of the Wheel of Fortune “fail” videos — a lot of times the contestant’s guess can’t be accurate, because it contains letters that have already been guessed.

Wordle has another interesting feature that tripped me up in designing the solver — you don’t know how many of a given letter there are unless you guess that many copies. In Wheel of Fortune, naming a letter immediately places all of those letters and by definition you know there are no more. Some of the harder Wordle solves result from needing to guess a double letter — and this is also why you could still run out of turns while guessing entirely on negative information.

However, consider the following sequence. The answer is NYMPH, which conveniently has no always-vowels.

READS ⬛⬛⬛⬛⬛ 698 possibilities
FRUIT ⬛⬛⬛⬛⬛ 123 possibilities
BLOCK ⬛⬛⬛⬛⬛ 2 possibilities

The only words that fit these criteria are NYMPH and PYGMY — despite us never even guessing a single letter in any position! I wouldn’t even use the latter word in casual speech, so there’s only one real answer here. We derived this entirely through negative information.

Questions

Ultimately, what drove my interest is that it seems to be totally possible to get any given word in six guesses. Is that really the case? Are there sequences that you could reasonably guess with perfect information that wouldn’t get you there? What if you just pick a random or naive strategy and apply it blindly?

Rather than solving single words on a daily basis, I want to solve all the words. I’d love to understand the nooks and crannies of the English language that create really absurd patterns, like guessing NYMPH with no correct letters. I also wonder how this extends to longer patterns — does the game get harder, and then easier at higher levels? How difficult could it be to come up with a 15 letter word when there are so few to choose from?

As with many things, the creator seems to have hit on something that works really well — 5 letters in 6 guesses. What kind of tuning could be done to “prove” this is an optimal combination of a certain difficulty? Can you make it harder or easier (aside from the perfect knowledge aspect embodied in Wordle’s “hard mode”)? As I mentioned above, are there words that are just harder or easier depending on your strategy?

I left a couple of exercises for me or the reader to continue with in the readme — reproduced here:

How “good” is a given guess against the current word? Against all Wordle words? Against any English word?
What strategies are most effective across these sets?
How “hard” is this game? What’s the average tree depth to get to a single word?
How many ambiguous words have the same results right up to the solution (i.e., LIGER-TIGER)? What’s the maximum tree depth of ambiguous words?
Can naive solvers win every time with 6 guesses? What’s the min/max for any given algorithm?
What words minimize initial guesses? (Cracking the Cryptic suggested RISEN/OCTAL to cover top letter frequencies. Tom’s solver produced RAISE as the best first guess.)

No doubt someone else will do this work before I have a chance to get to all of it! If you’re one of those people, please take my scripts with my blessing and let me know how it goes!

Agile
D&D

Gamifying Standup with D&D

by Mark2020-09-252020-09-25 No comments

As day 200 of pandemic life approaches, even the introverted engineers among us are starting to lose their marbles. A couple of months ago I started a standup rotation where each person takes a week at being scrummaster. One of the engineers, Will, decided to take this to a whole new level, and became a dungeon/scrummaster, combining standup with a one-shot D&D campaign related to the actual work we were doing in the sprint.

We couldn’t find any evidence that roll20 has ever been used to facilitate standup, so I thought I’d document it for posterity. First, we rolled for initiative to determine the order people would give their standup update in. Then, Will modified the questions slightly to work for both standup and D&D:

What did you do? (Standup)
What do you do? (D&D)

Will played fast and loose with the rules, since we have nine people in our standup and most people didn’t want to go through the trouble of rolling a character. Of course I had to make one, who I named Torb and styled after our Slack chatbot:

As the week went on, we tried (and failed) to slay the client dragon, guarding its pile of revenue. One of the product managers Leeroy Jenkins-ed his way at the dragon, and promptly received a debilitating fire breath in return.

Ultimately Will drew this week’s standups to a close by having us fight a mimic disguised as product requirements. Once we untangled ourselves from the requirements, we saved the dragon’s servants from a life of tedious data entry.

I see no reason why you couldn’t do this every week, and have your team’s updates actually become their D&D actions. No one seems to have built a Jira/Roll20 integration yet.

The stats for the dragon. © Will Garrett, used with permission.

BigQuery

More Google Maps Semantic Location History Fun with BigQuery

by Mark2020-07-192020-07-19 No comments

In the last post, I covered the basics of how to transform and load your Google Maps data, including Google’s parsed history data, into BigQuery.

In this post, I’ll share a few interesting queries I’ve written to look at semantic history. These should work just fine for you too if you follow the steps in the previous post.

Visited U.S. States

This query will show when you first visited a state. Google Maps needs to have placed you at a specific location you visited for it to appear in this list, so if you just drive through a state it may not register. You could get that data with your raw location history, and I’ll probably write that up later, but this assures that you at least “went” somewhere in the state.

WITH statedate AS 
(SELECT sl.StateAbbrev state, 
        sh.timestampstart tstart, 
        sh.timestampend tend
FROM location.semantic_history sh
INNER JOIN location.statelookup sl
ON sh.address LIKE(CONCAT('%, ', sl.StateAbbrev, ' %')))
SELECT state, 
       COUNT(*) locations, 
       MIN(tstart) firstVisit, 
       MAX(tend) maxVisit
FROM statedate
GROUP BY state
ORDER BY 2

This query requires a supplementary table, which I named statelookup. I just grabbed the query from https://gist.github.com/esfand/9443427 and modified it for BigQuery (delete identity column and change the VARCHAR/CHAR columns to STRING.) Tons of the public data sets have states available so you could join to one of those too.

This will return a list of every state you’ve been to within your location history, ordered by the first time you went somewhere in that state. You can also tack on an ANY_VALUE(name) to pick a random place in that state that you went to. (It won’t actually be random, because repeated visits increase the weight, so statistically you’ll probably get a familiar place.)

Fill-Up Frequency

This query will show (almost) every time you went to a gas station, although it can be extended to any other Google Places category.

SELECT name, duration, date, pointstart 
FROM location.semantic_history
WHERE 'gas_station' IN UNNEST(locationTypes)
ORDER BY date

I noticed that it missed a bunch of rural places in which I have definitely purchased gas, so you might consider an OR filter with common names of gas stations.

OR REGEXP_CONTAINS(ARRAY_TO_STRING(locationTypes, ' '), '(phillips|valero|chevron|shell|sheetz|texaco)')

Add whatever gas station names are common in your area. Some may also be missing because you didn’t remain at the place long enough for Google to register it; I’ve seen that issue with other locations too. I think some of these show up as waypoint locations, but these queries don’t account for that.

Any Other Kind of Location

You can take the query above and replace ‘gas_station’ with a bunch of categories. Here are some examples:

airport
gym
grocery_or_supermarket
electronics_store
restaurant
movie_theater

The full list can be found here. You actually can use most of the types it says you can’t use in the type filter, like point_of_interest or food, because we obtained these locations from ID mapping, not from a search.

Google is pretty inclusive for these tags, so Target, Staples, and even a piano store I went to come up as electronics_store.

Favorite Restaurants

As a variation on this, you can see which restaurants you dine out at the most, and on average how often you visit them. Using placeid as the GROUP BY here ensures that different locations of the same chain are separated out.

DECLARE startdate DATE DEFAULT '2018-01-01';
DECLARE enddate DATE DEFAULT '2019-12-31';

SELECT ANY_VALUE(name),
       COUNT(*) visits,
       ROUND(DATE_DIFF(enddate, startdate, DAY) / COUNT(*),0) frequency
FROM location.semantic_history
WHERE 'restaurant' IN UNNEST(locationTypes)
AND date BETWEEN startdate AND enddate
GROUP BY placeid
ORDER BY 2 DESC

Lastly, replacing the visits column with a duration check will show you your favorite places ordered by how many hours you spent there. There probably won’t be a huge difference unless you frequently disappear into your favorite bars for hours on end.

ROUND(SUM(duration)/3600, 2) totalTime,

Wrap Up

Other than the state abbreviation table, these queries don’t use anything besides the semantic data upload we already had. We didn’t even use any GIS queries to plot maps or intersections, which would give us all kinds of power to join other kinds of data.

For instance, you could join to weather data to see your experience with outdoor temperatures over time, by your personal location. If you missed a heat wave at home, your personal weather history would show that. You could even plot how close you were to every reported hurricane. Lots of fun data mining yet to come!

BigQuery

Google Maps Semantic History in BigQuery

by Mark2020-07-102020-10-07 No comments

In this post I talked about getting your location data from Google Maps and loading it into Google BigQuery. I mentioned that Google also provides its semantic history. This data structure includes Maps’ inferred place names and addresses and its guesses about mode of travel. It divides the data into two main event types, activitySegment and placeVisit.

placeVisit describes time spent stationary within a given latitude/longitude bound, and includes the Google Place ID, address, name, and a confidence score.

activitySegment includes start and end coordinates, as well as the waypoints and raw path, duration, and an activity type (“IN_PASSENGER_VEHICLE”, “IN_BUS”, “IN_TRAM”, et al.) with associated confidence score.

In order to load this data into BigQuery, I wanted to transform it to JSONL as before, but also to unify the two event types into a single structure so they can be queried together. Most of the important pieces, like location and duration, apply to both. This also allows for a single date partitioning scheme, so that you can query a single partition to get all activity for a single day. (You could easily store multiple people’s data in the same structure and continue to leverage the same partition scheme.)

Defining the Schema

The JSON schema / BigQuery table definition I decided on was:

Column	Type	Description
date	DATE	Local event date (adjusted for timezone)
type	STRING	“VISIT” for a placeVisit and “TRAVEL” for an activitySegment
latStart	FLOAT	The starting latitude coordinate of the event
lonStart	FLOAT	The starting longitude coordinate of the event
latEnd	FLOAT	The ending latitude coordinate of the event (“VISIT” is the same as start)
lonEnd	FLOAT	The ending longitude coordinate of the event (“VISIT” is the same as start)
moveType	STRING	The activity type (“IN_VEHICLE”, etc.) “VISIT” always has “NONE”
distance	FLOAT	Distance traveled in meters during the activitySegment. 0 for “VISIT”.
semType	STRING	Google’s semantic type for the location. (“TYPE_WORK”, “TYPE_HOME”)
placeId	STRING	Google’s place ID for the VISIT.
address	STRING	The geocoded address for the lat/lon pair for the VISIT.
name	STRING	The name of the location (“Wendy’s”, “Starbucks”, “The Colosseum”)
confidence	FLOAT	The confidence score for the named location
timestampStart	TIMESTAMP	The fixed UNIX epoch timestamp for the beginning of the event, in seconds
timestampEnd	TIMESTAMP	The fixed UNIX epoch timestamp for the end of the event, in seconds
duration	INTEGER	(timestampStart – timestampEnd), for convenience
latCenter	FLOAT	VISITs have an extra lat/lon pair, which I loaded here just in case
lonCenter	FLOAT	VISITs have an extra lat/lon pair, which I loaded here just in case
waypointPath	GEOGRAPHY	The coordinates of any waypoints along the activitySegment
rawPath	GEOGRAPHY	A simplified linestring showing the path for activitySegments
pointStart	GEOGRAPHY	The geography point for latStart/lonStart
pointEnd	GEOGRAPHY	The geography point for latEnd/lonEnd
pointCenter	GEOGRAPHY	The geography point for latCenter/lonCenter

Tabular form of the schema.json file.

I kept the non-geographic columns to mirror the source data, but they’re redundant to ST_X(pointStart) or ST_Y(pointStart). Date, however, is not — whereas the raw timestamp values give the fixed start and end values for the activity, Date is localized to the appropriate timezone. In fact, using a nifty npm package called geo-tz, it will actually automatically look up the timezone in the lat/lon the event is in! So even as you move between timezones, it will still place the event in the appropriate local date. If it can’t find one, it just defaults to my local timezone, “America/Los_Angeles”.

Coding the Script

Unlike last time, I wanted both a one-stop shop and an idempotent module that I could run over and over again as I worked on the data, without having to use the UI to repeat manual steps. I wrote a nodejs script that performs the following steps:

Transform all semantic history files into JSONL format
Zip and load file to Google Cloud Storage
Create the BigQuery table according to the defined schema (dropping and rebuilding each time)
Load the GCS file to BigQuery
Perform any necessary post-processing in BigQuery to clean the data

There weren’t a lot of challenges here, just your regular data transformation pipeline. Google Takeout doesn’t have a native API (there are workarounds) so I didn’t bother writing a repeatable pipeline to refresh the data monthly or weekly. (And for those of you who read my book, note that despite writing the book in Python, I’m more comfortable in Node for personal projects. I chose Python for the book because it’s compatible across Jupyter notebooks, Dataflow, Google Cloud Functions, and App Engine and I didn’t want to be jumping around in languages. Despite that, I still ended up with Python, BQ SQL, ZetaSQL, MySQL SQL, Javascript, and Java…) A couple of interesting notes though.

While Takeout provides the regular location history as a single, massive file in JSON or KML, the semantic versions are segmented into monthly files, i.e. 2020_JANUARY.json, 2020_FEBRUARY.json, etc. I just used a single glob wildcard to grab all JSONs in the subfolders, so you can dump your “Semantic Location History” folder directly into the script folder. Don’t mix in any other JSONs, though because it just uses ‘./sem/**/*.json’ as the glob pattern.

There doesn’t seem to be any way to actually gzip files for use with the BigQuery Node SDK. I didn’t do a lot of research on this, but I couldn’t get the SDK to take the file if I gzipped it. The semantic files are much smaller — mine were 8MB unzipped and 2MB gzipped, so I didn’t fiddle too much with this and just commented it out.

Longitude/latitude are stored as E7 values in the Takeout JSON, so I simply divided by 1E7 to restore the normal decimal format. Timestamps are stored with millisecond precision, so I divided by 1000 to load them as BigQuery timestamp files.

Loading GEOGRAPHY types from JSONL is kind of interesting too. Loading ST_POINTs isn’t too bad; I just used a string template in the form `POINT(${lon} ${lat})` and BigQuery was happy. MULTIPOINTs and LINESTRINGs were a little harder, mostly because the data often repeats the same point or contains no unique points at all. The comma and parentheses are also strict, so it requires a bit of string finagling to get it working right. Ultimately I used these WKT string formats:

Point	POINT(30 10)
Multipoint	MULTIPOINT ((10 40), (40 30), (20 20), (30 10))
Linestring	LINESTRING (30 10, 10 30, 40 40)

WKT formats BigQuery is okay with

You will need a GCP service account key with permissions to create/delete Google Cloud Storage objects and permission on BigQuery to run jobs, create, and delete tables in the dataset. Download it from the GCP IAM console and save it as service-account-key.json in the script folder to get access.

You also have to make your own GCS bucket, which you can do with gsutil or the Cloud console.

Running the Script

The script documents its progress, although it only takes about 45 seconds to run from my local machine with my file set. The majority of that time is spent in loading the file into BigQuery from GCS, which doesn’t take place on my machine anyway.

The script doesn’t have any real handling, so it’ll likely throw an UnhandledPromiseRejectionWarning if something goes wrong. If everything works correctly, you’ll be left with a yellow “Finished!” prompt and you can move into BigQuery.

Note also that there is an empty postprocessing step in which you could add any SQL you want to run to clean things up. For instance, some rows have starting and ending coordinates, but don’t have a distance, so you might want to UPDATE the ST_DISTANCE for those rows for cleaner distance statistics:

UPDATE `location.semantic_history`
SET distance = ST_DISTANCE(pointStart, pointEnd)
WHERE type = 'TRAVEL'
AND distance IS NULL

But now you have a neat table that contains all of your location history in an easily queryable form, with lots of helpful metadata.

Google Place IDs

Enabling the Places API and using an API call in the format

https://maps.googleapis.com/maps/api/place/details/json?place_id={PLACE_ID}&key={GOOGLE_MAPS_KEY}

gives you all of the Google Maps data on a given place, including operating hours, price level, the geometry for the whole place, reviews, and a bunch of other stuff. You could join this data together with your location history for some really interesting queries like “how many places was I at after hours?” or “how many expensive restaurants do I go to per month?” I’ll tackle this part in a later post, using the Node SDK for Google Maps. You can cache this data too, so I can probably build some sort of rudimentary table that lets me collect and load places data wherever it exists.

Interesting Queries

Even with just duration, latitude, and names, I can do some pretty neat things with this data. Here are just a few examples I tried.

How many times do I go to Starbucks each week?

WITH weeks AS
    (SELECT DISTINCT(DATE_TRUNC(a, week)) w
     FROM UNNEST(GENERATE_DATE_ARRAY('2014-01-01','2021-01-01')) a),
     
     times AS
    (SELECT DATE_TRUNC(date,week) w, count(*) count
     FROM location.semantic_history
     WHERE LOWER(name) LIKE '%starbucks%'
     GROUP BY DATE_TRUNC(date,week))
     
SELECT weeks.w, IFNULL(times.count,0)
FROM weeks
LEFT OUTER JOIN times
USING(w)

Or any other place, just change the “WHERE LOWER(name) LIKE” to whatever clause you want. How about number of hours spent on planes?

SELECT SUM(duration)/3600 FROM location.semantic_history
WHERE moveType = 'FLYING'

SELECT rawPath from location.semantic_history
WHERE moveType = 'FLYING'

The second query can be passed directly to BigQuery Geo Viz for a full visualization of your flights. Google only began tracking the activity type somewhere in 2016, so data before that won’t have it. This also seems to be missing a bunch of flights I’ve taken since then, so it probably needs refinement. Or, how about all trips ordered by decreasing speed in miles per hour?

SELECT (ST_DISTANCE(pointStart,pointEnd)/duration)*2.23694 mph, * 
FROM location.semantic_history
WHERE distance > 200 AND duration > 0
ORDER BY mph DESC

Noting that the maps distance is sometimes wrong, I recalculated it to do this math. Even that yielded some amusing outliers, no doubt due to poor cellular or GPS signal. Apparently, on one New Year’s Eve, I traveled from SFO to the middle of the Pacific Ocean at 930 mph. (This appears to have been Google thinking I landed at SFO when I actually landed at LAX.)

Lastly, you can generate the Google Maps direction string for any trip — not the route you actually took, just the default route so you can visualize the locations:

SELECT FORMAT('https://www.google.com/maps/dir/%s,%s/%s,%s/', 
       CAST(latStart AS STRING), CAST(lonStart AS STRING), CAST(latEnd AS STRING), CAST(lonEnd AS STRING)) mapsString
FROM location.semantic_history
WHERE type = 'TRAVEL'

Using the Code

The code for all of this is available on github at https://github.com/mark-mucchetti/bq-location. Feel free to download and modify for your own use!

BigQuery

Google Maps Location History with BigQuery

by Mark2020-06-192020-09-25 No comments

Google Maps can be configured to record your location and show your history over time. If you don’t mind the privacy implications of this, it’s a powerful tool to see where you’ve been and how often you go to certain places. You can also use it to calculate historical travel times, visualize road trips, and so on.

Unfortunately, the stock interface leaves something to be desired. It’s hard to plot paths or to search by locations. You can’t get a list of all the times you’ve been to a certain place. There are some pretty interesting insights about your habits and behaviors that are locked up inside the interface.

I was thinking about the impact that quarantine has had on my travel habits. The stay-at-home order in California took effect here around March 19, but we had imposed mandatory work-from-home on the 16th. The only way I could see that in aggregate on Google Maps was the graph of visited locations.

Column graph showing number of places visited by day — Well, that’s depressing.

That got me thinking, is there a way to visualize this data over a longer period of time? Could I do more interesting things like calculate the maximum distance traveled from home on any given day?

Google Takeout

It is possible to get your location history exported as a single JSON. To do that, go to https://takeout.google.com/settings/takeout and click “Deselect All”. Then scroll down to Location History, recheck it, and make sure the settings look like this:

Export your data, and it will begin the process. It can take several minutes to hours to do this, but in my experience it was usually less than five minutes. Download the JSON file into a new folder on your desktop somewhere.

Loading to BigQuery

Unfortunately, the JSON that Google Takeout outputs isn’t directly compatible with BigQuery. BigQuery uses JSONL, which is basically a file consisting of a separate JSON object per line. This is so you can load massive datasets into BigQuery without invoking a parser. I guess Takeout didn’t get the message, because it generated one giant 600MB JSON object that choked several parsers as I tried to view it.

So. I guess we have to do some preprocessing on the file before BigQuery will accept it. Guess I’m breaking out the Node. I’m not sure how I’d want to handle this if the file were, say, 600GB, but we can handle 600MB locally without too much hassle. I just cobbled a quick script together that converts a history JSON file into a JSONL file.

const fs = require('fs')
const filename = './lh.json'

let jsonr = fs.readFileSync(filename);
let jp = JSON.parse(jsonr);
let array = jp.locations;

let newfile = filename.replace(/.json$/g, '.jsonl')
let stream = fs.createWriteStream(newfile);
array.map(x =>
    { 
        let y = { timestamp: parseInt(x.timestampMs / 1000),
                  lat: x.latitudeE7 / 10000000,
                  lon: x.longitudeE7 / 10000000,
                  accuracy: x.accuracy };
        // foreshadowing: think about timezones...

        stream.write(`${JSON.stringify(y)}\n`)
    
});
stream.end();

All this does is take a file named lh.json that holds the JSON-formatted Google Takeout file, loads it into memory, and converts each line to a separate JSON object. I also preprocessed the timestamps and latitude/longitude pairs into decimals for easier GEOGRAPHY conversion. This also has the effect of making the result file substantially smaller by shortening the names — the resulting file was only 203MB. I didn’t bother to gzip it, but I also have a fast Internet connection. If you don’t, gzip the file now before the next step.

Then I transferred it directly to Google Cloud Storage for loading into BigQuery.

gsutil -m cp lh.jsonl 'gs://my-bucket/lh.jsonl'

After that, it’s a simple bq call to get the data loaded into BigQuery.

bq load --source_format=NEWLINE_DELIMITED_JSON \
        --time_partitioning_field=timestamp \
        location.history \
        gs://my-bucket/lh.jsonl \
        timestamp:TIMESTAMP,lat:FLOAT,lon:FLOAT,accuracy:INTEGER

After I did that, I added additional columns for GEOGRAPHY and for the date in my local timezone — and then I realized my error. The partitioning and clustering aren’t going to help if they are always off by seven or eight hours, and since I want to look at data day-by-day, this could be a big problem. Well, how bad can it be…?

SELECT * FROM location.history WHERE timestamp BETWEEN ‘2017-01-01’ AND ‘2017-01-02’	This query will process 246.2 KB when run.
SELECT * FROM location.history WHERE date BETWEEN ‘2017-01-01’ AND ‘2017-01-02’	This query will process 260.8 MB when run.

Oh, only a thousand times worse.

That bad. Luckily, this is easy enough to fix by migrating the data to a new table and deleting the old one.

CREATE TABLE 
location.history2  (date DATE,
                    timestamp TIMESTAMP,
                    lat NUMERIC,
                    lon NUMERIC,
                    accuracy INT64,
                    point GEOGRAPHY)
PARTITION BY date;


INSERT INTO location.history2
SELECT date, timestamp, lat, lon, accuracy, point
FROM location.history;

DROP TABLE location.history;
-- bq cp location.history2 location.history
DROP TABLE location.history2;

Then I copied history2 back to history — this can be done via command-line or the UI, so I just used the UI to copy the table. Back in business!

Querying the Data

There are lots of ways to skin this cat. I thought the most interesting way to look at this, rather than distance from home, was to calculate the maximum distance inside the smallest polygon that encompassed all of my locations for the day.

INSERT INTO location.history_date
SELECT date, 
ST_CONVEXHULL(ST_UNION_AGG(point)) convex_hull, 

-- the maximum distance between two equivalent hulls is the longest vertex
-- inside of that hull
ST_MAXDISTANCE(ST_CONVEXHULL(ST_UNION_AGG(point)),
               ST_CONVEXHULL(ST_UNION_AGG(point))) max_distance

FROM location.history
WHERE accuracy < 100
GROUP BY date

This is a little abstract, but I had a couple thoughts in mind. One, I’ve moved several times since I started collecting location data, which means that distance from “home” is arbitrary across time. Two, I travel a lot too, and I want to show relative motion. So while a day I go to Chicago will have a large value, if I stay in Chicago the next several days, the values should be small again to reflect that I didn’t travel as much during the 24-hour period. This worked basically as I expected. For example, here’s the day I first saw Hamilton in Chicago:

From the excellent BigQuery geo viz at https://bigquerygeoviz.appspot.com/.

This isn’t totally accurate, and the “convexification” eliminates internal boundaries, but it’s close enough for our purposes, which is just to determine the max length of a line inside the shape.

Applying this to the table on a daily basis and calculating a 7-day median value generates this graph for 2020:

After traveling cross-country around New Year’s, it was back to the daily commute with exceptions here and there. Then, during the week of 3/23, quarantine began and mobility dropped to zero. There’s a brief period of activity as restrictions were lifted in mid-June, but since then… not going anywhere.

As you’ve no doubt realized, this is the tip of the iceberg. The JSON export from the beginning also contains the semantic location data, which includes mode of travel with waypoints, inferred location with name, address, and confidence values, and the Google Place ID, which you can use to categorize all the locations.

That was sure a lot of work to make me feel depressed about the fact that I haven’t gone anywhere in four months.

BigQuery
ML

The Classifying Hat — An ML Harry Potter Sorting Hat with BigQuery

by Mark2020-06-112020-10-07 No comments

The wizarding world is notoriously haphazard. If a muggle school ran like Hogwarts, it’d be sued out of existence within the first term. Law and order is a one-way ticket to Azkaban. And the sorting hat labels eleven year olds for the rest of their lives as brave, sneaky, intelligent, or… well, whatever Hufflepuff is.

A major consideration for machine learning models is the concept of “interpretability”, namely, can you understand and explain why the model makes certain choices. The sorting hat has almost no interpretability, except on the rare occasions that it stubbornly defends its own choices as accurate. Can we make something that chooses as well as the Sorting Hat does, but is also interpretable?

First of all, in machine learning terms, it’d be called a “Classifying Hat”, since the type of ML model used to predict a choice given a set of input features is a multi-class logistic regression. So let’s build ourselves a Classifying Hat.

To do this, we’ll use Google Cloud Platform’s BigQuery Machine Learning capabilities, because you can do it in pure SQL directly from Google Sheets, and it’s really easy to train and predict. Since this isn’t a serious analysis (if that wasn’t obvious) there’s no reason to go crazy with the feature engineering. I follow pretty much the same set of steps for all BQML models anyway.

1. Find the Data

First we’re gonna need some good old Harry Potter data. I want a mix of features that would presumably be helpful in prediction, but nothing so detail-oriented that it would be hard to replicate for arbitrary sorting (i.e., if you wanted to see how the Classifying Hat sorts you.)

I settled on a mix of sources:

This Kaggle dataset has names, genders, wand, birthright, eyes, and hair. Since house selection seems to be significantly related to blood status and pedigree, this will probably help a lot. You also receive your wand before you get sorted, so there’s probably some useful predictive capability there.
This remarkable Harry Potter aggression analysis weighting characters in the book based on how often they appear and how aggressive their actions are. Since personality type is clearly a factor in sorting, this will be helpful. (Of course, we don’t know cause vs. effect here! It’s possible that being labeled so early in life influences your later actions.)
This compilation of related datasets includes a graph showing the relationships among all of the characters. I ended up not using this. Similar to how Facebook works, you could use this to predict a house for someone who “isn’t on the network” by seeing who they associate with. I didn’t use it because the books cover only the time students are at Hogwarts, and by definition they are going to be closely associated with others in their house, since they share sleeping and living quarters. This would be far more useful on the wizarding social network (“Facegrimoire”?) where you could easily determine someone’s house affiliation based on their connections.

I also pulled scripts from the first few movies and the texts of the books, for a secondary analysis on how people’s vocabularies are related to their house selections. That can be done at a later date (yeah, okay, I’m never doing that).

2. Prepare the Data

Now that we have enough data to start with, we’ll get it into a format we can create a machine learning model from.

The Kaggle dataset, while being super useful, was not well formatted, having semicolon delimiters and a ton of jagged arrays. I preprocessed it in Google Sheets and generated a table with blank cells instead of “Unknown”s, and cleaned up overly descriptive labels to make them easier for the model to bucket.

Character names also didn’t match among the various files due to spelling, use of aliases, alternate names, or extra middle names, so I used the aggression dataset’s convention of assigning a four letter slug (i.e. “harr” for Harry Potter, “tonk” for Nymphadora Tonks, “vold” for Tom Riddle, and so on.) and matched up the data.

Since there are only a couple hundred rows, I opted to do this largely manually using Google Sheets, where I could edit all of the values directly. As we’ll see in a second, it’s possible to run BQML models directly out of Google Sheets, which is pretty neat.

3. Load the Data

I placed each filtered “table” into a separate sheet of the Google Sheets file (you can choose to load from a specific sheet). Then I copied the URL to the clipboard and loaded the tables into BigQuery.

I didn’t realize it at first, but when you make a table from a Google Sheet, it’s actually a federated source. That means that it’s actually querying live from the sheet whenever you access the data. This also means there isn’t really any transactional integrity, so if you are making changes to the sheet while the query is running it sometimes generates internal errors.

Note also that whatever schema you supply will govern how far out the table goes. So if you only define three columns but your sheet has twenty, the table will only have three columns. If you edit the schema then the columns will automatically get populated with data from the corresponding sheet columns. Kind of weird.

I prefer to use a view as an input to my machine learning models so I can tweak the view without rebuilding the model, and can also look at the raw input. The CREATE MODEL statement can take several minutes, even on small datasets, and it’s easier to do preliminary feature selection and extraction before getting there.

Accordingly, my final CREATE VIEW statement looked like this. I didn’t bother to specify column names in the table schema and am just transforming them back out via the view.

CREATE OR REPLACE VIEW `hp.input`
AS
SELECT
string_field_0 ID,
string_field_1 named,
string_field_2 gender,
string_field_3 role,
string_field_4 house,
string_field_5 wand_length,
string_field_6 wand_wood,
string_field_7 wand_animal,
string_field_8 patronus,
string_field_9 race,
string_field_10 blood,
string_field_11 hair,
string_field_12 eyes,
string_field_13 loyalty,
string_field_14 skills,
string_field_15 birthdate,
fullname,
(SELECT SUM(mentions) FROM hp.aggression a WHERE a.abb = h.string_field_0) mentions,
(SELECT AVG(agg_weighted) FROM hp.aggression a WHERE a.abb = h.string_field_0) agg_weighted,
FROM hp.houses h
WHERE string_field_0 IS NOT NULL

Views (non-materialized ones, anyway) are still federated, so this will pull all the data together from multiple sheets and transform it every time you SELECT from it. That convenience will come in handy later.

4. Build the Model

Now I was ready to actually create and train the model. This meant transforming and selecting my final feature set and filters and building up a model. Per the above, I knew I wanted a multi-class logistic regression. This is a supervised model, which means I have to label input data with the correct class and use that to train the model. The typical split is 80/20, but for datasets with fewer than 500 rows, BQML defaults to using all the rows as training data. This isn’t great, since it raises the chances of overfitting. However, only ninety-odd people have been assigned to houses, so it’s all we get. (I was not inclined to sift through reams of fan fiction to plausibly tag any others.)

My first model creation statement:

CREATE OR REPLACE MODEL `hp.hm`
OPTIONS (model_type='logistic_reg', input_label_cols=['house'])
AS
SELECT * EXCEPT (id) FROM `hp.input` WHERE house IS NOT NULL

5. Evaluate the Model

The model ran for a few minutes (no hatstall here) and spit out the new model object. I went to the model evaluation, and lo and behold:

First attempt at ML sorting — The first model’s confusion matrix.

Hm. That smells like overfitting. Did I do something wrong? I was careful to remove ID numbers and tags that might bias the model. You could make a case that name shouldn’t be in there, and I did try that variation, but you could also make an argument that bias by last name is intentional by the Sorting Hat. Weasley? Gryffindor. Potter? Gryffindor. Malfoy? Slytherin. That may not be fair, but it probably does accurately reflect how the hat works.

Nonetheless, it still seemed a bit too on-the-nose, so I went back and checked what other columns I might have missed. Well, “loyalty” sure stood out. It turns out that if your loyalty includes “Death Eaters”, you are overwhelmingly likely to be in Slytherin. I nixed that column. I also nixed “species”, which was basically all human except for Hagrid, who was listed as “half-giant”, and immediately caused all non-humans to go to Gryffindor too. (That might also be right.)

For my second attempt, I made a few more changes. I compromised on sending only last name instead of full name. I also bucketized wand length by 2, so that 8-10 inches was a group, 10-12, and so on. Lastly, I removed birthdate extractions entirely. I thought zodiac sign or decade might be interesting, but the data is too vague to be of much use.

CREATE OR REPLACE MODEL `hp.hm`
TRANSFORM
(
named,
gender,
ML.QUANTILE_BUCKETIZE(CAST(wand_length AS NUMERIC), 2) OVER() wand_length_bucketized,
wand_wood,
wand_animal,
blood,
hair,
eyes,
house,
)
OPTIONS (model_type='logistic_reg', input_label_cols=['house'])
AS
SELECT * EXCEPT (id, role, race) FROM `hp.input` WHERE house IS NOT NULL

A bit concerning; it’s still basically perfect.

If this were a real-world data set, I would now be extremely concerned about implicit bias and the possibility that the model was making judgments purely superficially in a totally unfair way. I do think that’s happening, but I also think the Sorting Hat does too (and this kind of proves it) so we’ve reached the standard for novelty level.

As a last step, I took surname out of the model just to see how much that was really skewing things.

I guess there really is a secret class system.

6. Make Some Predictions!

Making a prediction is easy: we just select from ML.PREDICT with our model, and it’ll add the target feature prediction to the table. We have a bunch of characters in the table who don’t have houses, either because it was never defined, they’re in schools other than Hogwarts, or they’re muggles. Who cares, let’s predict the rest of the table.

SELECT house, predicted_house, fullname 
FROM 
ML.PREDICT(MODEL `hp.hm`, (SELECT * FROM `hp.input`))
ORDER BY house NULLS FIRST

This yields some entertaining and plausible results such as sorting Dobby the house elf into Hufflepuff and Nurse Pomfrey into Slytherin. It also places all the Dursleys in Gryffindor, oddly. You can add your own examples and see how it makes its decisions, which is at least one thing the Sorting Hat can’t/won’t do!

Next Steps

I was recently playing around with the use of Google Forms to automatically drop rows into Google Sheets, and then to write some Apps Script to do things with that. It wouldn’t be too hard to take inputs from a Google Form and trigger an automatic BQML prediction based on new data. In other words, you could turn this ML classifying hat into a charming little quiz!

It would also be interesting to drop this into a k-means model and see if it segments people into four distinct houses, or if it has something else in mind.

The Google Sheet with my data is here, so if you want to give it a spin, all you need to do is make a copy to your own Google Drive and import it into a BigQuery table (use the rightmost sheet, “Houses”.) Have fun!

This analysis was used as the basis for a talk at SQL Saturday Los Angeles on June 13, 2020. It also draws from my book, BigQuery for Data Warehousing, out now!

BigQuery

Sample Data Warehouse Charter

by Mark2020-05-282020-10-06 No comments

In my book, I repeatedly emphasize the importance of chartering and running a project to encapsulate your data program. The first artifact in this process is the project charter, around which business stakeholders can agree about what your program is and what it should do.

Appendix B of the book provides an outline of a sample charter. You can download a blank version of the charter here: