Integrating Clusters

Friday, September 12th, 02025 at 12:21 UTC

We started this article in Feb 02014. Most of the ideas here came from working at a start-up doing natural language processing around video game forums. We tried to distill lots of information into both visualizations, summaries and topics.

Also, 8 or so years prior to that, we were at SxSW in 02006 and Flickr had recently begun tag clusters. As you added tags to your images they were clustering like images into super-clusters. One panel swore it was so good it was curated, and another explained how they did it with k-means clustering. Twenty years later, we have giant LLM probability machines holding large numbers of interconnected vectors. The data gets bigger, the techniques change, but the goals are almost always the same.

Reading over it, the article still holds-up. The points are still relevant, both because we keep failing to think continuously and because it are agnostic to any tool.

We went through it, updated the grammar, tense and a few other facts, otherwise it’s presented as it was written 11 years ago.

Recently, we worked a lot with data mining, clustering and classifying people. For various projects, people want to know what their readers, players, volunteers and viewership is like. People want nice, neatly packaged-up groups that they can potentially sell too.

Organizations don’t know how to sell to you or me as an individual, but as a white-male, 25-35, with a college degree, they have a strategy for that! It was never feasible to have as many different types of customers as there are customers. Simplifying things into generalized clusters made the workload, computing and budgeting easier.

The problem with clustering is that you begin to put all white-men, aged 25-35 into the same bucket that must enjoy the same type of music. Based on a few variables you make wide assumptions to fill in the rest of the gaps.

Marketers aren’t the first to do this, look at Zodiac. It is the Year of the Goat, so there are predictions about the children born this year. They will be prosperous and lucky… in 10 years time will they statistically be more prosperous and lucky than others? Probably not. But we generalize and imaging these things because it is easy, not because it is correct.

A recent prototype involved a huge amount of effort from the volunteers and team members who will be using it for their daily operations. As we’ve design this, we are doing our best to not boil people down to some 5 star rating. That would be the easy route, show me all team members who are above 3.5 stars. Somehow they are more intrinsically “good” than someone with a 3.4 star rating. (What about the number of reviews, or the standard deviation?) All our personalities, history, customs, backgrounds and individuality; all summed-up in a single scalar value.

We want to avoid falling into these traps. So what are the option? Well, you have as many possible clusters of people as there are people. An organization with 10,000 people have a minimum possible cluster size of 1, creating 10,000 clusters. Previously this was never feasible, but it is becoming more and more possible as our computational power increases.

We’ve drew inspiration from the Bartle Taxonomy of players, which has 4 categories of gamers and Claritas PRIZM, which has 68 segmentations. The more possible clusters, the more descriptive and nuanced each cluster can become. Claritas PRIZM is describing households for direct marketing. There is the trade-off between being fine-grained enough for their marketing efforts, but not too fine-grained that no one can design what category they want to sell their stuff too.

We’ve been exploring clustering for our clients. They want to know how their customers are acting, exploring, using their product as well as how they can use this information to their advantage in the future. We proposed several clusters within their data. We could use something similar to the Bartle Test or Claritas PRIZM, but wouldn’t it be even more exciting to increase the number of clusters?

One of the big steps between algebra and calculus is the act of the continuous integration. If you want to know the area under a curve, you can start by summing up lots and lots of small rectangles which fit under the curve. We know how to easily compute the area of a rectangle. If we continuously make them skinnier and skinnier we can fit more and more under the curve, with each reduction in size, the computed area becomes more and more correct. We can take this to the extreme and make them infinitely skinny and we’re now creating integrals and entering into the world of calculus.

The ancients did something very similar when trying to work out the value for pi. They took a circle and put a square inside it so the corners touched and another square completely on the outside of the circle. They knew the area of the inside square and the area of the outside square, so the circle had to be in between those two values. The next step was to change the square into a pentagon, which fit nicer around the circle. Then a hexagon, and continuing to add more and more sides, it began to look more and more like a circle. With each iteration, the area of the circle was getting more and more exact. Taken to the extreme, a shape with infinite sides becomes a circle.

In life, we are rarely dealing with the discrete, it’s usually continuous. So why do the same thing for clusters of data?

What would be ideal is to have a system that is the best of both worlds; continuous values that allows for everyone to be unique, but at the same time have some sort of clusters that we can discuss and make inferences about.

Humans have the ability to do this; to hold both a continuous concept in our heads and give it a discrete name. Colors are a good example of this (although culturally it isn’t perfect).

The spectrum of color is a continuous set of values going from infrared through to red and the visible spectrum out to violet and ultra-violet and onwards. There is no specific point at which red becomes orange, or orange becomes yellow, yet we have a cluster and a concept for the each color. Culturally we sort of know when colors change and when we can’t agree we tend to create new clusters, like yellow-green or light-green. It flexes when it needs too. Every color has an exact value, but it also shares a cluster with many of its neighbors.

Rather than an enumerated list of possible clusters any given person could be in, we could begin to map out continuous variables and explain our customers in terms of these values, then we can map those values onto some Probability Density Function to explain (with some probability) that a person is in one group or another or some state in between. It isn’t blue or purple, but some blue-purple color.

No two people are the same, yet we tend to bundle them into just a few clusters. We already know that names are powerful and useful tools, but at the same time they can’t be ridged structures with hard boundaries. Having a limited group of named clusters is fine as long as you realize the data underneath is continuous and flows.

Our color spectrum is an excellent example. In a database, we’d save the exact RGB value and call it by its cluster name: RED, ORANGE, YELLOW. This way we don’t loose the exact information had we only kept the cluster name. Then in the future we could change the cluster boundaries or add more facets/variables and create new clusters. Having the raw data keeps things fluid and dynamic and prevents the data from getting stale.

Additional Reading

The Big Sort: Why the Clustering of Like-Minded American is Tearing Us Apart. By Bill Bishop

Categories: Briefs, You're doing it wrong

Tags: algebra, calculus, clusters, gradient, integrating, sort, spectrum

(optional.is)

Integrating Clusters

Additional Reading