Finding Related Items (2011)

I’ve mentioned in the past (not in this blog) that C++ programs can beat databases by many, many orders of magnitude. I learned this while doing a specific fun project. I have permission from the company to discuss that project, and here is a description of that project.

I was working for a contract for ThisNext. Their database has a large number of records of different items, which a large number of people have tagged with arbitrary tags. Think millions of items, millions of distinct tags, and hundreds of millions of times that tags were applied to items. (Often the same tag has been applied to the same item multiple times.) Their idea was that this data could be used to discover which other items are somehow "related" to the current one, and possibly of interest if you find the current one interesting. This is used to populate related items, which is intentionally somewhat whimsical. For instance if you are interested in a pair of silk hotpants then perhaps you might be interested in velvet shorts, lumberjack hotpants or a Bo Peep Jumpsuit.

The suggestions are intentionally somewhat whimsical, and are generated by user activity. The idea behind them is that two items with tags in comment are likely to be related. The more common the tag, the less related they are. For every item they wanted the 30 most related items, sorted in the order of how related they are.

There is an obvious computational problem that you will face. Suppose a popular tag was applied to 30,000 different items. Then for 30,000 items you’re going to have to compare to 30,000 other items, which gives you 900,000,000 relationships. This clearly is a scalability problem. So their initial version had implemented this as they just counted the number of tags in common between items, and they skipped any tag that was used over 500 times.

This originally worked fairly well, but as time went by it slowly got worse. The first problem is that their program slowed down as they got more data. When I arrived it was taking a couple of hours to run and was affecting when they could start running backups. The second problem was that if you had an item that had only been tagged once with a popular tag, such as "sunglasses", then no suggestions would be generated for it. They wanted me to try to solve one or both, preferably both, problems.

Loading more...