Finding similar sequences of categorical data
How people spend their time is fascinating. It is our most precious resource and consequently our time-use reflect our priorities. Finding people who spend their days similar to our own can be enlightening — both because it’s intrinsically interesting and so we can understand ourselves through the lens of others.
A single day’s time-use can be represented as a string of activities. Here, every hour is depicted by a rectangle and its activity by a color.
This prototypical day tells us the person works a roughly 9-5 day and spends their free time relaxing or socializing. They most likely are not in school and do not have children requiring care. As a result it can be presumed they have a steady full time job, are college educated, and make above median income.
Answer the questions below to see who spends their day similar to you. The question-series yields common weekday sequences that may match your own; however the sequences are entirely editable to fit your exact day.
The below plots reflect the above 25 individuals measured on six attributes. The heights of the bars represent the frequency for that category.
In short, a day’s time use activities are reduced to a single string of characters and the edit distance between the strings is computed. More similar days have smaller distances.
There are many choices of edit distance method. Most rely on counting the number of insertions, deletions, and substitutions to transform one string to the other — with each method stipulating different costs for each operation. The below four are some of the most common:
Each of these have advantages and disadvantages. For the use case here, dynamic Hamming produced the most interesting clusters: it captured both the unique sequencing of the activities and their timing within the day.
This speaks to the main advantage of sequence techniques over comparing simple sums of time-use activities; the sequence techniques incorporate not just the prevalence of the activity but also when they occur within the day and which activities precede and follow.
The data comes from the American Time Use Survey covering over ~200,000 interviews between 2003 to 2018. The original ~100,000 weekdays of data needs to be condensed down into a dataset small enough to be quickly loaded by the browser.
Hierarchical clustering based on the edit distance is used to discover the seven most interesting and distinct clusters. While a smaller number of clusters results in the best cohesion-to-separation (measured via silhouette width), seven clusters provides the best balance between the numerical measure and subjective interpretations of cluster distinction.
The below dendrogram shows distances between the observations and how they coalesce at varying cluster k values. The colors represent the seven clusters.
Each cluster’s individual sequences reveal what makes the clusters differentiable. Cluster 1 consists of mostly socialites/relaxers; cluster 2 night workers; cluster 3 household activities; cluster 4 students; cluster 5 morning workers; cluster 6 early-morning workers; and cluster 7 evening workers.
The clusters are simplified to their modal sequence. These modes are used as the first step to classify the user’s input. The user input is first matched to the closet modal sequence (based on edit distance) and then matched to the top 25 sequences of 400 potential sequences that are a priori stratified sampled from the matched cluster.
Find the code here: github.com/joemarlo/ATUS-similarity