Thursday, December 20, 2007

Indexing, Searching documents - Full-text

Lucene, Xapian and Swish-e.org are open source solutions. A comprehensive list.

PostgreSQL Full-text (tsearch2) is better than MySQL full-text. Postgres performs as well as Lucene, MySQL doesn't come close.

Yes, Lucene is specifically designed for search, but there are many advantages to using something like PostgreSQL is it performs on par. The details of the search can be described more articulately in SQL than in a search grammar. Additionally, it would allow us to later join the search results against "other" data for the purposes of simple intersection as well as altering the relevance based on some piece of data known outside of Lucene.

If going ahead with database based indexing, it would be better to take a look at Sphinx. This is being used at curse.

Sphinx is a full-text search engine, distributed under GPL version 2. Commercial license is also available for embedded use.

Generally, it's a standalone search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL or PostgreSQL, or using XML pipe mechanism (a pipe to indexer in special XML-based format which Sphinx recognizes).



Xapian is very well-recommended and would work well for intense loads.

You should take a look at Xapian (http://www.xapian.org). I've messed with Lucene (I'm also not a Java fan) and TSearch2 GiST/GIN (I've been a PostgreSQL DBA for 5 years), and neither seemed as simple or scalable as Xapian. I mostly use the python bindings, and I was able to handle thousands of queries per second with a concurrency level of 10 against a 16GB Xapian db containing millions of documents. It's feature-full (http://www.xapian.org/features.php), indexing and searching are incredibly fast, administration is very straightforward (http://www.xapian.org/docs/admin_notes.html), and it scales quite well (http://www.xapian.org/docs/scalability.html). It even has a remote backend for distributed searching and indexing (http://www.xapian.org/docs/remote.html). If I was implementing a large-scale full text searching solution right now, I'd definitely use Xapian. By the way, thanks for writing such a great book.

Also found some useful information on mod_python memory usage.

I often see the complaint by people about mod_python’s memory overhead, but when you query them about it, they more often than not have no basis for the claim and are usually just repeating what someone else has said. Since you have a large site using it and have made this comment, I would be quite interested to here from you directly what basis you have for pointing out the memory overhead of mod_python. As much as we would like to address memory overheads issues in mod_python, it seems no one running real sites ever comes to the mod_python mailing list to share their experiences.

It seems MySQL is better at replication than Postgres. Needs further investigation. MySQL Cluster and Slony might be completely different.

My experience with mysql vs postgres is that these days, depending on what you are doing postgres can easily beat mysql on a single box system, but when it comes to replication mysql wins hands down. Slony is a complete dog. Once you move past a single box postgres has severe problems.

Alfresco is an open source Enterprise Document Management System.

Soumen Chakrabarti may be the right person to talk about all these.

No comments: