Nagendra Nagarajayya (Comparison searches made on 12/21/2010)
The test was to query the perl index made up of contents from perl.org (including perldoc.perl.org and forum messages at www.nntp.perl.org). Google is the best at search while lucene is very popular and returns very relevant search results. RankingAlgorithm compares very well to google (almost similar for some) and returns much better results than Lucene for this mixture of documents, faq, etc.
The below terms were used in the test ( the terms were made at random but meaningful to perl). The results for the first two query terms are shown in Fig1a/Fig1b/Fig1c/Fig2a/Fig2b/Fig2c. The rest are as in the parenthesis. But you can use the demo at http://solr-ra.tgels.com/rankingsearcch.jsp, to try this out yourself by entering any query related to perl. First select Perl index on the index options, next choose RankingAlgorithm library Document mode and click search Solr-RA to get the results. Now go back, select Lucene library and click search Solr-RA to see the results using the Lucene library. To search, the perl.org google site index, you will have to go to perldoc.perl.org, enter a query term, click search, go to the bottom of the page to find a google site search text box, enter the query term there to see the results. The below URL a copy from the search might also work:
Test Results for Query terms entered:
1.extracting words with regular expressions (same as google, lucene off), see Fig1a/Fig1b/Fig1c compare with google lucene
2.regular expressions usage (similar to google, lucene off), see Fig2a/Fig2b/Fig2c compare with google lucene
3.regular expressions questions (similar to google, lucene not bad)
4.regex lower case (close to google, lucene not bad)
5.return lower case (google better, very close , lucene not close)
6.embed perl (same as google, lucene less relevant)
7.how to embed perl in c programs (same as google, lucene less relevant)
8.embed perl in C ( lucene and google better)
9.convert scalar to array (different but both relevant, lucene not close)
10.convert string to date (similar as google, lucene not relevant)
11.date to milliseconds ((very relevant, google different, lucene similar but no so relevant)
12.perl program arguments (same/similar as google, lucene different)
13.extracting string from arrays(better than google, lucene not so good)
Screen Snapshots showing comparison between Google/RankingAlgorithm/Lucene
1. extracting words with regular expressions
Fig 1a
Fig 1b
---- Fig 1c
2.regular expressions usage
---- Fig 2a
---- Fig 2b
---- Fig 2c
----
----
Perl Index Creation Steps:
The perl index was created by downloading the docs at perldoc.per.org. Next, perl.org and nntp.perl.org was crawled using httrack. The downloaded files were next uploaded to Solr to be indexed using Apache Tika/Solr Cell using the java code at, http://tgels.com/downloads/docs/IndexingSolrWithTikaAndJava.html. A script, postfiles.sh available under the example directory was next used to submit html, pdf, word, txt, etc. documents to Solr for indexing.
Steps to recreate index:
1.download docs form perdoc.doc.org and extract them
2.crawl perl.org and nntp.perl.org with httrack
3.run postfiles.sh /pathtoperldocs http://localhost:8983/solr/perl (note: test installation had multiple cores, perl is a core)
4.run postfiles.sh /path to perl.org files/ http://localhost:8983/solr/perl
5.run postfiles.sh /path to nntp.perl.org files/ http://localhost:8983/solr/perl
The Solr core, Perl index default text field schema :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Note: Stemming had been turned off