Danny 'Jay' Donnell

On Code

Collaborative Filtering Using JRuby and Mahout

One of my labor of loves is a small community blogging site called Yakkstr. There are a few hundred active users and the site allows them to subscribe to posts so they can be notified when there are new comments on the post. I’ve wanted to build a collaborative filter that will give users a list of posts they are likely to be interested in and may have missed (i.e. didn’t subscribe to it). This is a fairly standard collaborative filtering problem.

Note: The trick with mahout is choosing the right classes for similarity, neighborhood, and recommender.

I’ll use users post subscriptions to determine which other users they are most similar to. The easiest way to get data into mahout is through a csv file. The example below has two fields: user_id, post_id

4,1
7,2
4,4
1,4
4,3
8,1
8,3
4,5
4,6
6,6
...

The full data from Yakkstr as of today includes 21k subscriptions. Getting this into Mahout is straight forward, I’ll use the FileDataModel class.

Ruby (mahout_model.rb) download
1
2
MahoutFile = org.apache.mahout.cf.taste.impl.model.file
model = MahoutFile.FileDataModel.new(java.io.File.new("subscriptions.txt"))

Next I need to choose the similarity metric I want to use. Since the data is binary, a user has subscribed to a post or they haven’t, I’ll use the TanimotoCoefficientSimilarity which is a good way to get the similarity between sets. We’ll also setup the neighborhood, using NearestNUserNeighborhood, this will allow us to get the N most similar users to a given user.

Ruby (mahout_similarity.rb) download
1
2
3
4
5
MahoutSimilarity = org.apache.mahout.cf.taste.impl.similarity
similarity = MahoutSimilarity.TanimotoCoefficientSimilarity.new(model)
MahoutNeighborhood = org.apache.mahout.cf.taste.impl.neighborhood
neighborhood = MahoutNeighborhood.NearestNUserNeighborhood.new(5, similarity, model)

The final step is to choose the mahout recommender implementation I want to use. Since I want recommendations for a user based on binary preference data I’ll use GenericBooleanPrefUserBasedRecommender.

Ruby (mahout_recommender.rb) download
1
2
3
4
5
MahoutRecommender = org.apache.mahout.cf.taste.impl.recommender
recommender = MahoutRecommender.GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
# Ask for 10 recommendations for user 3
recommendations = recommender.recommend(3, 10)

That’s it, a collaborative filter using JRuby and Lucene in 10 lines of code (not counting the requires). Full source below

Ruby (mahout_fullsource.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
require 'java'
require 'rubygems'
require 'pry'
Dir.glob('mahout/lib/*.jar').each { |d| require d }
require 'mahout/mahout-core-0.5.jar'
require 'mahout/mahout-utils-0.5.jar'
require 'mahout/mahout-math-0.5.jar'
MahoutFile = org.apache.mahout.cf.taste.impl.model.file
model = MahoutFile.FileDataModel.new(java.io.File.new("post_preferences.csv"))
MahoutSimilarity = org.apache.mahout.cf.taste.impl.similarity
similarity = MahoutSimilarity.TanimotoCoefficientSimilarity.new(model)
MahoutNeighborhood = org.apache.mahout.cf.taste.impl.neighborhood
neighborhood = MahoutNeighborhood.NearestNUserNeighborhood.new(5, similarity, model)
MahoutRecommender = org.apache.mahout.cf.taste.impl.recommender
recommender = MahoutRecommender.GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
recommendations = recommender.recommend(3, 10)
r = recommend(3, 5, neighborhood, model, similarity)

I’ve long believed that the core value of JRuby, much like a killer app for an OS, will be the killer wrappers of good Java libraries. There are all these amazing Java libraries that are out of site, out of mind, for the average rubyist. In a recet post I discussed Lucene, another awesome JRuby library for which there isn’t a good MRI equivalent.

Setting Up Aquamacs for Clojure and General Goodness

tl;dr

If you’re already proficient with elisp and emacs you can see my config file here and everything I’m using was installed via package.el from the marmalade repo.

Setting up Aquamacs

Configuring emacs is getting easier and easier due to package.el and Marmalade.

The first step is to install package.el if your’e using emacs 23 or lower (package.el will be included with emacs 24).

$ cd ~/Library/Preferences/Aquamacs\ Emacs/
$ curl -o package.el http://repo.or.cz/w/emacs.git/blob_plain/1a0a666f941c99882093d7bd08ced15033bc3f0c:/lisp/emacs-lisp/package.el

Next, edit ~/Library/Preferences/Aquamacs Emacs/Preferences.el

Restart Aquamacs. You can now view all packages in marmalade with M-x package-list-packages and install them by click on the name which will open a pop up and clicking on ‘install’.

Here are the packages I have installed.

  • anything
  • anything-config
  • anything-match-plugin
  • coffee-mode
  • clojurescript-mode
  • haml-mode
  • sass-mode
  • scss-mode
  • starter-kit-ruby
  • starter-kit-js
  • color-theme
  • color-theme-sanityinc-solarized

A quick rundown of my favorite features.

anything

I use anything in place of switch-buffer, try it’s great. I’ve also added the anything-git-project function which will match against all files in the git repo that the current buffer is in. This is like CMD-t in textmate.

ido

I’ve made ido much more friendly. Seeing is believing, so try it.

themes

I haven’t found the perfect theme, but color-theme-sanityinc-solarized is very close.

Now it’s clojure time

First install leiningen, instructions are here

Install the swank-clojure plugin

$ lein plugin install swank-clojure 1.3.2

Install clojure-mode via package.el (M-x package-list-packages)

Now let’s create a new clojure project with lein

$ lein new test-project
$ cd test-project
$ lein deps

In Aquamacs, open one of the clojure files from the project then run (this can take a few seconds)

M-x clojure-jack-in

Yay, you’re now in slime with clojure and the classpath for your lein deps setup properly. I’m a big fan of paredit and have a hard time writing lisp without it these days. Give it a try if you aren’t familiar with it, but it will take a day or two to get used to it. This Cheatsheet will help.

Lastly, you can see my entire emacs configuration here

JRuby and Lucene: K I S S I N G

I LOVE Ruby. The language is fun to use, the community is vibrant with good taste, and the ecosystem is diverse. JRuby is a major part of this diversity, and many of the rubyists I run into don’t take advantage of it. sadface

This is a mistake, now let’s see how easy it is to use lucene in JRuby. Lucene is a search engine, but the components of a search engine (tokenizers, term frequencies, etc) are very useful on their own. Below are some examples. I left out some of the code that setups the modules in JRuby, you can check out the full runnable source on github.

Ruby (lucene.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Tokenization (This is hideous, it's lucene's fault)
puts 'Tokens'
t = Lucene::StandardTokenizer.new(Lucene::Version::LUCENE_CURRENT, java.io.StringReader.new("I am 127.0.0.1"))
charTermAttribute = t.getAttribute Lucene::TokenAttributes::CharTermAttribute.java_class
while t.incrementToken
puts charTermAttribute.to_s
end
puts
# searching the index
puts 'searching'
# assuming we already have an index at ./text.index
# check the github link to see the code for creating the index
searcher = Lucene::Search::IndexSearcher.new(Lucene::Store::FSDirectory.open(java.io.File.new('test.index')));
t = Lucene::Index::Term.new("title", "some");
query = Lucene::Search::TermQuery.new(t);
docs = searcher.search(query, 10);
docs.totalHits.times do |i|
puts searcher.doc(docs.scoreDocs[i].doc).get("title")
end
puts
# get the term frequencies of a term
reader = Java::OrgApacheLuceneIndex::IndexReader.open(index)
t = org.apache.lucene.index.Term.new("title", "some")
freqs = reader.term_docs(t)
term_count = 0
while(freqs.next)
term_count = term_count + freqs.freq
end
puts "Term Count for 'some': " + term_count.to_s
puts

Side Note: The Lucene API is awful, and it’s use is ugly even in JRuby, but it’s not hard to wrap it in a warm blanket of ruby. I once helped do this, but it hasn’t been updated to use the latest version of Lucene. It’s a great library if you don’t care about that so check it out.

Is Your Idea Clearly Expressed in Your Code?

The following passage from The Algorithm Design Manual jumped out at me as I read it.

“The heart of any algorithm is an idea. If your idea is not clearly revealed when you express an algorithm, then you are using too low-level a notation to describe it”



This is talking about the choice of English, psuedocode, or real code to express an algorithm, but I thought of my frustrations with some programming languages, Java being one of them. Even in small examples the noise of Java obscures the idea. The two pieces of code below do the exact same thing, and I instantly see what the ruby version is doing. The Java code requires my brain work at parsing out the relevant parts.

Java (clearly.java) download
1
2
3
4
5
6
Collection<String> output = Collections2.transform(cur.toArray(), new Function<DBObject, String>(){
public String apply(final DBObject input){
return input.toString();
}
});
String o = Joiner.on(",").join(output);
Ruby (clearly.rb) download
1
o = cur.map { |e| e.to_s }.join(",")

I wonder how much of a hit to productivity this adds for each programmer that reads the code and what affect this has on bug rates given that there is a measurable relationship between LOC and bugs.

Side Note: This isn’t about static vs dynamic typing. The scala code for this is virtually identical to the ruby code.