Indexing and Searching

Indexing and Searching

This blog was originally published on November 27th, 2007. It was published on TypePad and can be found here.


**Background I am a very heavy user. Whenever I get a new computer the first thing I do is install the firefox plugin. I use it in combination with a couple of other sharing programs in IT101 at Bentley to get my students thinking about distributed web based sharing.

Intellectually I understand the benefit of tagging. I remember being really excited by folksonomies. The problem is I can never remember the tags I apply to a story. Was it future learning education teaching time media literacy or was it time. Well come to find out it wasn't time, and I spent a long time looking for that link in my feed.

The vitue of a folksonomy, is that you are crowd-sourcing the classification process. Maybe the youtube video of the schmuck showing off shouldn't be schmuck, but asshole. You can't really figure this out till you tap into the wisdom of the crowd, and determine which tag is used most for classification.

I don't really take advantage of's sharring feature. I don't compare my bookmarks versus anyone elses, and there the problem lies. If you rely on as a central repository for all the goodies you find, you require a pretty strict tagging methodology in order to find anything. I'd argue that you would need to implement a complex classification system mimicing the dewey or the LOC. If you weren't classifying based on Author - Publisher - Title - Date, what could you tag on?

Well I have essentially given up the idea of using's tags for organizing my bookmarks. I have built a Nutch Search Engine to crawl and index my bookmarks. Now if I need to find an article, and I remember that it was on, I type in time and I have instantly found what I am looking for.

Using Nutch came out of an inability to find anythign using's built in search function. Their search doesn't actually look at what it is linking to, just your links hosted by Indexing the link your bookmark is pointing at sounds like a no brainer to me, and I wonder why (parent of doesn't implement this. It takes a significant cognitive load off of the user. If he or she remembers that the article was on education (the word education is used 21 times in that article) nutch will find it. While's search requires that you would have tagged that article with the word "education" in order to find it.

Nutch Installation I installed Nutch on my Mac OS X Server machine. I downloaded the latest version of Nutch 0.9. Nutch requires a JDK of at least 1.5, and an installation of Tomcat5.

I created a JAVA_HOME variable of: JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/

And I installed tomcat5 via Darwin Ports: sudo port install tomcat5

I untar'd nutch.0.9.tar.gz into /usr/local/

And then edited /usr/local/nutch.0.9/conf/nutch-site.xml to use the delicious-thumbnail According to, it is the only User-agent allowed to crawl. My nutch-site.xml file looks like: delicious-thumbnail TestNutch http.agent.description NutchSpiderman NutchSpiderman http.agent.url

I then changed directories into /usr/lcoal/nutch.0.9/, and created a directory called seeds, and a file called urls inside seed. The urls file contains one link:

Once the URL seed list is created, I edited the /usr/local/nutch.0.9/conf/crawl-urlfilter.txt and changed the last line in the file from -. (Which skips everything except parameters explicitly specified earlier in the file) to +. (Which will traverse past the domain name, for example to

Now that those edits are done, you can crawl your book marks with:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

It took about 24 hours for my nutch installation to crawl and index my bookmarks. I ended up with over 175 MB of text files in my index once the crawl was done.

If you are interested in search in general, I would highly recommend investigating the wonderful lucene based nutch project. If you become particularly interested, you can even distribute your crawling and indexing across several machines, using the wonderful distributed processing system hadoop.