Using Lucene – External Indexes

time to read 3 min | 414 words

Lucene is a document indexing engine, that is its sole goal, and it does so beautifully. The interesting bit about using Lucene is that it probably wouldn’t be your main data store, but it is likely to be an important piece of your architecture.

The major shift in thinking with Lucene is that while indexing is relatively expensive, querying is free (well, not really, but you get my drift). Compare that to a relational database, where it is usually the inserts that are cheap, but queries are usually what cause us issues. RDBMS are also very good in giving us different views on top of our existing data, but the more you want from them, the more they have to do. We hit that query performance limit again. And we haven’t started talking about locking, transactions or concurrency yet.

Lucene doesn’t know how to do things you didn’t set it up to do. But what it does do, it does very fast.

Add to that the fact that in most applications, reads happen far more often than write, and you get a different constraint system. Because queries are expensive on the RDBMS, we try to make few of them, and we try to make a single query do most of the work. That isn’t necessarily the best strategy, but it is a very common one.

With Lucene, it is cheap to query, so it makes a lot more sense to perform several individual queries and process their results together to get the final result that you need. It may require somewhat more work (although there are things like Solr that would do it for you), but it is results in a far faster system performance overall.

In addition to that, since the Lucene index is important, but can always be re-created from source data (it may take some time, though), it doesn’t require all the ceremony associated with DB servers. Instead of buying an expensive server, get a few cheap ones. Lucene scale easily, after all. And since you only use Lucene for indexing, your actual DB querying pattern shift. Instead of making complex queries in the database, you make them in Lucene, and you only hit the DB with queries by primary key, which are the fastest possible way to get the data.

In effect, you outsourced your queries from the expensive machines to the cheap ones, and for a change, you actually got better performance overall

Tweet Share Share 18 comments

Tags:

Design

Comments

10 Mar 2010
10:57 AM

Chuck

We've been using Lucene since Sept 2009. My first app used the API directly, but I migrated to NHibernate Search because I realized that I was re-hydrating domain objects.

Everyone has been pleased with the results, so of course they ask for more features: Spatial!

(BTW...Enjoying learning Boo...IQuackFu has to be my favorite interface name...ever).

10 Mar 2010
14:01 PM

Grimace of Despair

Maybe there exists already something like it, but what about a scalable MSSQLucene, which does index synchronizing in the background?

10 Mar 2010
14:04 PM

Simon

This makes me thing of things like CQRS, stash the domain model in the RDBMS and then use something else to present all of the views, in this case Lucene.

Interesting.

10 Mar 2010
15:41 PM

Laurent Kempé

Funny!

Exactly what we implemented in a project that should become public in the following weeks.

10 Mar 2010
17:38 PM

Justin

"Compare that to a relational database, where it is usually the inserts that are cheap, but queries are usually what cause us issues."

What are you basing this on? In my experience the opposite is true, writes are more expensive than reads in an RDBMS especially if your writing to tables that have indexes and constraints. Reads(selects) are usually very cheap comparatively and if the proper indexes are in place and being used by the query plan, nearly "free" as well especially if as in MS SQL server you lower the isolation level to "READ UNCOMMITTED" or at least "SNAPSHOT" depending on the situation. This has to do with physical read vs write performance characteristics of persistent memory hardware(hard drives) and that populating indexes is more expensive than finding results in them as it is a form of pre-computing values. Of course you can always forgoe indexes and constraints to speed inserts and slow selects.

Transactional consistency is another point, in a typical RDBMS like MSSQL the indexes are consistent with the data, such that once an insert or update completes you are guaranteed the associated indexes are updated as well. I haven't looked at what you've done with Lucene and your datastore, but I would imagine it would be quite involved to make sure the index in Lucene is populated fully after a write in the datastore so that subsequent query are returning "the truth". This may not be important for certain uses cases but in my experience most business apps expect and rely on the query results to reflect the current data store state in order to make business logic decisions.

10 Mar 2010
20:14 PM

Dmitry

I agree with Justin.

Also, relational databases are not very efficient with parallel writes into the same table. While multiple reads can be resolved using snapshots a or lower isolation level, long simultaneous writes into the same table are very likely to block each other.

10 Mar 2010
22:17 PM

Chris Carter

I think Ayende was referring to the kind of reads that you'd be using Lucene for, ie searching across clusters of related entities using various methods, from directly equal to very fuzzy. Ie, the kinds of things you do a lot of in modern applications but tend to shoehorn into a relational query that ends up .

Correct me if I'm wrong though.

10 Mar 2010
22:21 PM

Terrance A. Snyder

Query = SOLR/Lucene

Reading Records = Memcached/Velocity

Updating Records = MSMQ => Memcached => DB/Disk/Cloud/whatever

You are right that 99% of the effort is inverse in most places. All that focus is on the "writing". People think its the most important part, it's not, users dont see it and dont care if its a text file or a clustered SQL instance. They just care that it works for them, works fast (i hate that statement), and can adapt to their needs.

The whole point is to avoid the slowest part of the system. Either (A) your code, or (B) a physical disk spinning around in circles getting 100,000 peoples profile information.

It still amazes me to see forms being developed by large companies that contain 12+ fields that mimic the database design bleeding through to the UI. Give me a single search box please anyday.

10 Mar 2010
23:46 PM

Justin

There seems to be a pervasive view that a RDBMS must fufill a read(select) from disk and instead you should use memcached for reads.

I work mostly with MSSQL, but like MSSQL most mature RDBMS's have a page cache where the engine will keep pages in memory up to it's given limits and fufill read request from that cache and not disk.

What this can mean is an typical RDBMS can perform like an in memory data store given you provided it enough ram to hold the entire DB in memory as you would have to say with memcached. If ram space is smaller than database size it will try to cache most used pages(which are typically indexes).

You can also typically scale out the read side of an RDBMS with things like log shipping in MSSQL to have a single writeable server with many read-only secondaries fulfilling selects for the application.

Of course an typcial RDBMS will be slower at writes than an in-memory db, but that is simply physics, since durable media is currently slower than the volatile type.

Also another another pervasive view is an RDBMS cannot perform certain queries well, such as "document" or "fuzzy" type searches. This is more of an vendor implementation issue than an architectectural one. MSSQL has a ok full text indexer fully integrated. But beyond that it is not very difficult to implement your own inverted index in the relational model using two tables and a set of triggers. We did this in our application to implement quick person name searches with alternates and phonetic matches i.e. "bob" and "robert" etc.. Hierarchical searches can be done using the nested set approach instead of recursion, and so on. There is no reason a RDBMS cannot implement various forms of indexing specific to the data type of the column. Just looks at MSSQL xml and spatial indexes as an example.

11 Mar 2010
10:12 AM

Ayende Rahien

Justin,

That only holds true if you are making relatively simple queries and have a LOT of indexing on your RDBMS.

As a simple example, answer the question what are the most commented posts with the tag "NHibernate" is a pretty expensive opeartion in the DB using normal relational models.

But inserting adding a new tag, or adding a new comment, are very cheap operations.

11 Mar 2010
10:14 AM

Ayende Rahien

Justin,

In addition to that, using READ UNCOMMITTED is a good way of looking at corrupt data. I had to chase that once or twice, and I would never do so again.

For Raven, there is an explicit model for knowing whatever an index is stale or not.

11 Mar 2010
10:16 AM

Ayende Rahien

Justin,

Regarding the in memory cache, that only works if your working set can be held in memory, many times it cannot.

I think that you also forget the cost of SQL licenses, if I can pay zero for external index, vs. pay a minimum of 6,000$ (per CPU!) for a read slave, I know what I'll choose.

11 Mar 2010
17:22 PM

Justin

"As a simple example, answer the question what are the most commented posts with the tag "NHibernate" is a pretty expensive opeartion in the DB using normal relational models."

Something like:

Select top 10 Post.Title, count(*) FROM Posts

JOIN PostComments

ON Post.PostID=PostComments.PostID

WHERE Post.PostID in

(SELECT PostID FROM PostTags where PostTags.Tag='NHibernate')

GROUP BY Post.Title

ORDER BY count(*) desc

Is not a very expensive query in my experience, we do stuff much more complex all the time with sub second response from large databases with many concurrent users.

"In addition to that, using READ UNCOMMITTED is a good way of looking at corrupt data. "

Not unless the version of MSSQL you used had a bug, but we have been using it in production since SQL 2000 without issue. All it does it let you read rows without locking the entire result set and they may be involved in another transaction, but it won't give you a corrupt row. This is from the MSSQL BOL:"Read uncommitted (the lowest level where transactions are isolated only enough to ensure that physically corrupt data is not read)".

READ UNCOMMITTED is great for many types of queries such as end users running searches, and is actually the only equivalent isolation many "NoSQL" db's provide. You just have to use it in the right situation or you can get undesired results, but not "corrupt data". If you need a more transactionally consistent view in you selects then move up to SNAPSHOT, which is barely more expensive in a read heavy DB, an is basically what Oracle has been doing for ages but was introduced in MSSQL in 2005. Don't use READ COMMITED or above unless you absolutely have to, this will causes locking issues.

"Regarding the in memory cache, that only works if your working set can be held in memory, many times it cannot." This is true of any database regardless of type, it's not specific to RDBMS.

"I think that you also forget the cost of SQL licenses, if I can pay zero for external index, vs. pay a minimum of 6,000$ (per CPU!) for a read slave, I know what I'll choose."

No I don't forget, and I cannot argue here, other than in my experience you get what you pay for. I don't need NH Prof either but it can save me time and time is money. I don't need MSSQL either, I can cobble something together with various free alternatives, or try and make my own DBMS like your are doing, nothing wrong with that, I just choose to pay for a mature DBMS and focus my development time on the application layer.

Granted the free alternatives are getting better all the time, and my make commercial DBMS obsolete, but that is not the case currently IMO.

11 Mar 2010
21:47 PM

Paul

Free alternatives - there's an quaint term. I remember when Mysql was 'free' and had a cheap commercial licence. Then Sun bought it and wacked up the price 10 times. If there's something you can be guranteed in life - that's death, taxes, and Oracle swallowing up great open source database products.

11 Mar 2010
22:27 PM

Terrance A. Snyder

I have to disagree with memcached "Regarding the in memory cache, that only works if your working set can be held in memory, many times it cannot.".

Memcached is the storage for your thick objects stored as id lookups. Rather than serializing/deserializing/loading/querying for these via a database operation 99% of the time (you still do it once when you prime-the-cache).

Simply use Lucene/SOLR to index your data. The indexing of the data will prime the cache and allow you to effectively query it. Once you get your ids back (faceted search/etc) you can then read your thick object from cache to display.

The other option is to save your object to Lucene, which is the approach you are using (Netwtonsoft JSON it looks like) and use Lucene for storage and retrieval.

Novel idea, I'll probably explore it to see the scale. Basically one notch higher than EAV and one below RDBMS.

12 Mar 2010
08:07 AM

Ayende Rahien

Justin,

sub second response

You need to define what sub-second means. When I am talking about Lucene, I am talking about single digit milliseconds or less, in most cases.

And that query is going to be pretty expensive when you have enough data and activity on the database. Add a few of those type of queries, and you get to an unacceptable response rates.

All it does it let you read rows without locking the entire result set

Now, it doesn't. It let you read UNCOMMITTED data. That is, by definition, corrupt, because you are looking into data that may be inconsistent. Looking into what other transactions are doing (when they may be rollbacked, or only half done), is a great way to violate the consistency guarantees that you need to provide.

I am not talking about corrupt data in the physical sense, I am talking about corrupt data in terms that you get a row that make no sense whatsoever.

If you want, you can use things like WITH(NOLOOK) or WITH(SKIPPAST) to avoid locking, but using READ UNCOMMITTED is bad as a general practice.

actually the only equivalent isolation many "NoSQL" db's provide

I don't know where you got that impression from. All the NoSQL dbs that I am aware of will not let you see uncommited data.

12 Mar 2010
09:53 AM

Rafal

Reading uncommited data is a bad practice, I've seen many problems caused by data being visible in one transaction and not visible in another because they haven't been commited yet. Much better solution is to avoid locking by using the 'READPAST' hint in MS SQL Server - this way you don't touch any records that are locked at the moment.

12 Mar 2010
17:59 PM

Justin

"You need to define what sub-second means. When I am talking about Lucene, I am talking about single digit milliseconds or less, in most cases."

You need to define what size dataset and what type of query will Lucene return a result in less than one millisecond. Going by your Post with Tag example I recreated a slightly more complex query involving 5 table joins to get to the "Comments" and 1.7million "Post" equivalents and MSSQL returned the result in 65ms including network round trip.

As a more complex example our own inverted indexing system for Person Names returned all "Posts" made by someone named "Robert Smith" in 97ms, this includes "bob smith" and "bobby smith" etc. This kind of inverted index search is what a document indexer like Lucene is based on but again can be recreated in the relational model see www.csc.uvic.ca/courses/csc485d/200909/IRSql.pdf for an introduction.

"Now, it doesn't. It let you read UNCOMMITTED data. That is, by definition, corrupt, because you are looking into data that may be inconsistent."

Now we are arguing semantics, but I disagree, uncommitted rows are not corrupt, transactions are logical, corruption is physical, the row may be rolled back or otherwise modified at a later time, but it is wholly written and the data is readable and not corrupt at the time the query processor read the row. This can be bad at a logical level depending on the use case, but it will not break things at a physical level like a corrupt row would. For queries such as end users running ad hoc searches, READ UNCOMMITTED is usually the way to go, since they are simply looking for hits and not expecting the app to lock that data set so they can edit it exactly as it was on the search results screen nor do they expect the 100th search result to be in the same state it was when the 1st search result was read, they know the results are transient based on when the search was executed in a multiuser system, if they run the search again they my get different results.

Again if you can't confirm READ UNCOMMITTED is ok for the use case, then step up to SNAPSHOT(which is what I used for the examples) and get the MVCC concurrency model with nearly the same performance on a read heavy db.

"I don't know where you got that impression from. All the NoSQL dbs that I am aware of will not let you see uncommited data. "

Many NoSQL's don't give you any isolation beyond the "row" equivalent(document, item, whatever), you can't wrap multiple row updates up into one transaction. This is essentially READ UNCOMMITTED when reading data from them. To quote from Wikipedia: "NoSQL systems often provide weak consistency guarantees such as eventual consistency and transactions restricted to single data items, even though one can impose full ACID guarantees by adding a supplementary middleware layer "

Look at CouchDB for an example, single document updates are atomic but thats it, if you run an update on 10000 documents and half way through read the 1st documented updated, it will be in the new state not the old, this is READ UNCOMMITTED. Also run a run a view over 10000 documents, then update one of those documents while the view is reading results, the view could return that updated document in the new or old state based on when the view engine hit that document, but not as it was when the view was started, this is equivalent to READ UNCOMMITTED.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB