Indexing and Searching del.icio.us

This blog was originally published on November 27th, 2007. It was published on TypePad and can be found here.

**Background I am a very heavy del.icio.us user. Whenever I get a new computer the first thing I do is install the del.icio.us firefox plugin. I use it in combination with a couple of other sharing programs in IT101 at Bentley to get my students thinking about distributed web based sharing.

Intellectually I understand the benefit of tagging. I remember being really excited by folksonomies. The problem is I can never remember the tags I apply to a story. Was it future learning education teaching time media literacy or was it time. Well come to find out it wasn't time, and I spent a long time looking for that link in my del.icio.us feed.

The vitue of a folksonomy, is that you are crowd-sourcing the classification process. Maybe the youtube video of the schmuck showing off shouldn't be schmuck, but asshole. You can't really figure this out till you tap into the wisdom of the crowd, and determine which tag is used most for classification.

I don't really take advantage of del.icio.us's sharring feature. I don't compare my bookmarks versus anyone elses, and there the problem lies. If you rely on del.icio.us as a central repository for all the goodies you find, you require a pretty strict tagging methodology in order to find anything. I'd argue that you would need to implement a complex classification system mimicing the dewey or the LOC. If you weren't classifying based on Author - Publisher - Title - Date, what could you tag on?

Well I have essentially given up the idea of using del.icio.us's tags for organizing my bookmarks. I have built a Nutch Search Engine to crawl and index my del.icio.us bookmarks. Now if I need to find an article, and I remember that it was on Time.com, I type in time and I have instantly found what I am looking for.

Using Nutch came out of an inability to find anythign using del.icio.us's built in search function. Their search doesn't actually look at what it is linking to, just your links hosted by del.icio.us. Indexing the link your bookmark is pointing at sounds like a no brainer to me, and I wonder why yahoo.com (parent of del.icio.us) doesn't implement this. It takes a significant cognitive load off of the user. If he or she remembers that the article was on education (the word education is used 21 times in that article) nutch will find it. While del.icio.us's search requires that you would have tagged that article with the word "education" in order to find it.

Nutch Installation I installed Nutch on my Mac OS X Server machine. I downloaded the latest version of Nutch 0.9. Nutch requires a JDK of at least 1.5, and an installation of Tomcat5.

I created a JAVA_HOME variable of: JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/

And I installed tomcat5 via Darwin Ports: sudo port install tomcat5

I untar'd nutch.0.9.tar.gz into /usr/local/

And then edited /usr/local/nutch.0.9/conf/nutch-site.xml to use the http.agent.name delicious-thumbnail According to del.icio.us/robots.txt, it is the only User-agent allowed to crawl. My nutch-site.xml file looks like:

http.agent.name delicious-thumbnail TestNutch http.agent.description NutchSpiderman NutchSpiderman http.agent.url dataero.com dataero.com http.agent.email tom@dataero.com

I then changed directories into /usr/lcoal/nutch.0.9/, and created a directory called seeds, and a file called urls inside seed. The urls file contains one link: del.icio.us/mcgonagletom

Once the URL seed list is created, I edited the /usr/local/nutch.0.9/conf/crawl-urlfilter.txt and changed the last line in the file from -. (Which skips everything except parameters explicitly specified earlier in the file) to +. (Which will traverse past the del.icio.us domain name, for example to time.com)

Now that those edits are done, you can crawl your del.icio.us book marks with:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

It took about 24 hours for my nutch installation to crawl and index my bookmarks. I ended up with over 175 MB of text files in my index once the crawl was done.

If you are interested in search in general, I would highly recommend investigating the wonderful lucene based nutch project. If you become particularly interested, you can even distribute your crawling and indexing across several machines, using the wonderful distributed processing system hadoop.