Transaction Indexer

When you do web mining, you usually take into account several documents (or transactions) and try to find connections between words in these documents. A practical example of this is finding connected products in people’s shopping carts, used by Amazon for example to suggest new products. To achieve this, Association rules are extracted from the shopping carts history in the database using some algorithm. The most successful and popular algorithm for this is the Apriori algorithm.

Now, in order to start implementing the algorithm the data should be ordered and the number of transactions that contains each specific term needs to be computed. I chose to implement this in a separate mini-project and to publish the C++ source code here.

This transaction indexer (as I called it) will receive as input a text file with the following form:

<doc id="1">
bread milk butter eggs
<doc id="2">
milk tomatoes butter
<doc id="3">
bread eggs potatoes

And will output two files, a dictionary file containing all items (alphabetically ordered) and the transaction count for each item and an indexed transaction file that contains the item indexes instead of the actual items. They will look like this:
dictionary file:

bread 2
butter 2
eggs 2
milk 2
potatoes 1
tomatoes 1

and the indexed transaction file:

1 2 3 4
2 4 6
1 3 5

The program uses a Binary search tree to sort the dictionary and calculates the outputs recursively.

More information about specific methods can be found in the documentation file generated with Doxygen and included in the attached zip archive.

If I will have time I will also put on the rest of the implementation later. Download file here.