Danny 'Jay' Donnell

On Code

Collaborative Filtering Using JRuby and Mahout

One of my labor of loves is a small community blogging site called Yakkstr. There are a few hundred active users and the site allows them to subscribe to posts so they can be notified when there are new comments on the post. I’ve wanted to build a collaborative filter that will give users a list of posts they are likely to be interested in and may have missed (i.e. didn’t subscribe to it). This is a fairly standard collaborative filtering problem.

Note: The trick with mahout is choosing the right classes for similarity, neighborhood, and recommender.

I’ll use users post subscriptions to determine which other users they are most similar to. The easiest way to get data into mahout is through a csv file. The example below has two fields: user_id, post_id

4,1
7,2
4,4
1,4
4,3
8,1
8,3
4,5
4,6
6,6
...

The full data from Yakkstr as of today includes 21k subscriptions. Getting this into Mahout is straight forward, I’ll use the FileDataModel class.

Ruby (mahout_model.rb) download
1
2
MahoutFile = org.apache.mahout.cf.taste.impl.model.file
model = MahoutFile.FileDataModel.new(java.io.File.new("subscriptions.txt"))

Next I need to choose the similarity metric I want to use. Since the data is binary, a user has subscribed to a post or they haven’t, I’ll use the TanimotoCoefficientSimilarity which is a good way to get the similarity between sets. We’ll also setup the neighborhood, using NearestNUserNeighborhood, this will allow us to get the N most similar users to a given user.

Ruby (mahout_similarity.rb) download
1
2
3
4
5
MahoutSimilarity = org.apache.mahout.cf.taste.impl.similarity
similarity = MahoutSimilarity.TanimotoCoefficientSimilarity.new(model)
MahoutNeighborhood = org.apache.mahout.cf.taste.impl.neighborhood
neighborhood = MahoutNeighborhood.NearestNUserNeighborhood.new(5, similarity, model)

The final step is to choose the mahout recommender implementation I want to use. Since I want recommendations for a user based on binary preference data I’ll use GenericBooleanPrefUserBasedRecommender.

Ruby (mahout_recommender.rb) download
1
2
3
4
5
MahoutRecommender = org.apache.mahout.cf.taste.impl.recommender
recommender = MahoutRecommender.GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
# Ask for 10 recommendations for user 3
recommendations = recommender.recommend(3, 10)

That’s it, a collaborative filter using JRuby and Lucene in 10 lines of code (not counting the requires). Full source below

Ruby (mahout_fullsource.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
require 'java'
require 'rubygems'
require 'pry'
Dir.glob('mahout/lib/*.jar').each { |d| require d }
require 'mahout/mahout-core-0.5.jar'
require 'mahout/mahout-utils-0.5.jar'
require 'mahout/mahout-math-0.5.jar'
MahoutFile = org.apache.mahout.cf.taste.impl.model.file
model = MahoutFile.FileDataModel.new(java.io.File.new("post_preferences.csv"))
MahoutSimilarity = org.apache.mahout.cf.taste.impl.similarity
similarity = MahoutSimilarity.TanimotoCoefficientSimilarity.new(model)
MahoutNeighborhood = org.apache.mahout.cf.taste.impl.neighborhood
neighborhood = MahoutNeighborhood.NearestNUserNeighborhood.new(5, similarity, model)
MahoutRecommender = org.apache.mahout.cf.taste.impl.recommender
recommender = MahoutRecommender.GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
recommendations = recommender.recommend(3, 10)
r = recommend(3, 5, neighborhood, model, similarity)

I’ve long believed that the core value of JRuby, much like a killer app for an OS, will be the killer wrappers of good Java libraries. There are all these amazing Java libraries that are out of site, out of mind, for the average rubyist. In a recet post I discussed Lucene, another awesome JRuby library for which there isn’t a good MRI equivalent.