The Curse of Dimensionality
One of the nice things about the star-schema (and its close relative, the snowflake schema) is that it forces you to consider all dimensions of your reportable data. Suppose you’re delivering ads over multiple mobile channels and want to report the number of deliveries over time. A first attempt might be to create a record (in star-schema parlance, a fact table) that has the following values:
- delivery attempts
- verified deliveries
These summarize groups of individual deliveries according to some selection criteria. These criteria are essentially dimensions. A first attempt at selecting useful dimensions might be the following:
- date
- hour
- delivery channel
- campaign
Simple enough. This reduces the individual deliveries into hourly records organized by delivery channel and campaign. If we’re dealing with large volumes, this makes reporting easier. Actually, it makes reporting feasible. In some cases, these dimensions might suffice. I seriously doubt it, though. In the real world, both the mobile operator and the ad sales organization will invariably find this scheme simplistic.
Consider hourly reporting intervals. This may be sufficient for some channels but not for others, especially live or looped video. So, it’s likely that day and hour will need to be replaced with something like:
- day and time
- interval duration
where interval duration may be 30 minutes, 15 minutes, or even smaller.
Delivery channel is another candidate for decomposition. Here, the decomposition is likely to be different kinds of services. SMS ads can be pushed by themselves to users or appended to existing messages (e.g., peer-to-peer SMS or operator-generated messages), so the simple “delivery channel” turns into:
- delivery channel
- operator service
If the delivery channel is content-based (as opposed to a messaging channel), we need to know the context of the delivery, for example:
- content type
- ad location with respect to content
Merely summarizing deliveries by campaign is insufficient, too; they need to be organized by all components of a campaign structure, such as:
- campaign / flight / creative
Finally, what about the location, demographics, and behavior of the subscriber who received the ad? We need to add the following:
- location
- demographic attributes (age, sex, etc.)
- behavioral attributes
So, our initial attempt to summarize deliveries according to four simple dimensions has now gotten extremely ugly. Does that mean we shouldn’t have started with those four dimensions but instead jumped directly into every dimension we could think of? Not necessarily. It does mean, though, that we need to plan ahead for extra dimensions and not be surprised when they become requirements.
By the way — apologies to anyone who found this page by searching for the phrase “curse of dimensionality”. This term is used in machine learning and statistics to convey the exponential growth of a problem space as dimensions are added. Suppose we were looking for patterns in our ad delivery data. If we represent our deliveries as points in n-space, where n is the number of dimensions we’re using to organize it, then most of the space will be empty and we’ll have a hell of time looking for those patterns. But that’s a topic for another post.
