Subscribe to DSC Newsletter

Is there anyone who regularly does Market Basket Analysis / Frequent Item Sets, and can recommend any software to use (or not to use)? I am particulary interested in cost,ease of use,amount of data preprocessing required and how much data it can cope with. I have already tried Weka apriori algorithm and was wondering what else is out there.

Also, is there a 'common' format that shops store their sale data as. For example, I have shop ID and time stamps for each item and have to use these to determine which items are in the same basket (all items in a basket get the same time stamp). Is this 'the norm' (ie one item per row in the database and some common link)?

Thanks in advance for sharing your experience,


Views: 563

Replies to This Discussion

I have done over 30+ Market Basket Analysis jobs in the last 15 years.

There is usually some form of register ID or transaction ID along with store ID and time stamp. Also if the store has a loyalty card program the the Individual or household ID can be very useful.

About ten years ago I decided to write my own code to do this type of analysis. I did this because none of the tools that I found commercially available at the time would scale or could be run in parallel. I also found that most comercial could relied on the 20 year old concept from IBM (Anticedent and Consequent) when a Market Basket does not imply any order.

There are two pillars my code (unix shell script and perl) depend on. The data is sorted so that all the transactions are available and that i am only going to determine item pairs by applying "Yule's Q" in order to selct item pairs that go together with high odds. I will use Graphical techniques, transitive closure, to build unique groups of items, these groups partition the item universe.

If you are a coder you can easily write the pair counting your self as follows
1 limit the number of pair counts so that you do not run out of memory. My code wrote pair counts to disk when the number of pairs I was tracking reached 1M. (i just flushed the memory to disk and freed the memory)
2 inorder to do pair analysis you only need to track three things. The total number of baskets. How many baskets does each item fall into. Given a pair of items (a,b) how many baskets have both items a and b. (Becareful (a,b) = (b,a) I usually order the item numbers so that I am not doing double accounting.) This can be as bad as an (N*(N-1)+N+1) memory problem with N equal to the number of items.
3 if you are required to flush the counts to disk then gather them togeter before determining the ODDS of a pair occuring and applying Yule's Q to the ODDS.

Steps 1 and 2 will take 97% of the analysis time. Steps 3 and the following Transitive closure are very fast because the amount of data is greatly reduced. (If you can effectively divide the data into multiple streams preserving the basket then step 2 can be done in parallel.

I recently tested KXEN's KAR algorith to count pairs. This will work with the correct parameters set. However it will run out of memory, if you are not careful, since it does not cache intermediate counts to disk.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service