Reviewing ResinPart III
In the previous part, I started looking at UpsertTransacction, but got sidetracked into the utils functions. Let us focus back on this. The key parts of UpsertRansaction are:
Let us see what they are. The DocumentStream is the source of the documents that will be written in this transaction, its job is to get the documents to be indexed, to give them a unique id if they don’t already have one and hash them.
I’m not sure yet what is the point there, but we have this:
Which sounds bad. The likelihood is small, but it isn’t a crypto hash, so likely very easily broken. For example, look at what happened to MurmurHash.
I think that this is later used to handle some partitioning in the trie, but I’m not sure yet. We’ll look at the _storeWriter later. Let us see what the UpsertTransaction does. It builds a trie, then push each of the document from the stream to through the trie. The code is doing a lot of allocations, but I’m going to stop harping at that from now on.
The trie is called for each term for each document with the following information:
The code isn’t actually using tuple, I just collapsed a few classes to make it clear what the input is.
This is what will eventually allow the trie to do lookups on a term and find the matching document, I’m assuming.
That method is going to start a new task for that particular field name, if it is new, and push the new list of words for that field into the work queue for that task. The bad thing here is that we are talking about a blocking task, so if you have a lot of fields, you are going to spawn off a lot of threads, one per field name.
What I know now is that we are going to have a trie per field, and it is likely, based on the design decisions made so far, that a trie isn’t a small thing.
Next, the UpsertTransaction need to write the document, this is done taking the document we are processing and turning that into a dictionary of short to string. I’m not sure how it is supposed to handle multiple values for the same field, but I’ll ignore that for now. That dictionary is then saved into a file and its length and positions are returned.
I know that I said that I won’t talk about performance, but I looked at the serialization code and I saw that it is using compression, like this. This is done on a field by field basis, while you could probably benefit from compressing them all together.
Those are a lot of allocations, and then we go a bit deeper:
First, we have the allocation of the memory stream, then the ToArray call, and that happens, per field, per document. Actually, if we go up, we’ll see:
So it is allocations all the way down.
Okay, let us focus on what is going on in terms of files:
- "write.lock" – this one is pretty obvious
- *.da – stands for document address. Holds a series of (long Position, int Size) of document addresses. I assume that this is using the same sort as something else, not sure yet. The fact that this is fixed size means that we can easily skip into it.
- *.rdoc – documents are stored here. Contains the actual serialized data for the documents (the Dictionary<short, Field>), this is the target for the addresses that are held by the “*.da” files.
- *.pk – holds document hashes. Holds a list of document pk hash and a flag saying if it is deleted, I’m assuming. From context, it looks like the hash is a way to update documents across transactions.
- *.kix – key index. Text file holding the names of all the fields across the entire transaction.
- *.pos – posting file. This one holds the tries that were built during the transaction. This is basically just List<(int DocumentId, int Count)>, but I’m not sure how they are used just yet. It looks like this is how Resin is able to get the total term frequency per document. It looks like this is also sorted.
- *.tri – the trie files that actually contain the specific values for a particular field. The name pattern is “{indexVersion}-{fieldName}.tri”. That means that your field names are limited to valid file names, by the way.
The last part of the UpsertTransaction is the commit, which essentially boil down to this:
I think that this was very insightful read, I have a much better understanding of how Resin actually work. I’m going to speculate wildly, and then use my next post to check further into that.
Let us say that we want to search for all users who live in New York City. We can do that by opening the “636348272149533175-City.tri” file. The 636348272149533175 is the index version, by the way.
Using the trie, we search for the value of New York City. The trie value actually give us a (long Position, int Size) into the 636348272149533175.pos file, which holds the posting. Basically, we now have an array of (int DocumentId, int Count) of the documents that matched that particular value.
If we want to retrieve those documents, we can use the 636348272149533175.da file, which holds the addresses of the documents. Again, this is effectively an array of (long Position, int Size) that we can index into using the DocumentId. This points to the location on the 636348272149533175.rdoc file, which holds the actual document data.
I’m not sure yet what the point of *.pa and *.kix is, but I’m sure the next post we’ll figure it out.
Comments
Multiple values for a field is something I've used very, very frequently in Lucene. It's a feature many aren't even aware of exists in Lucene. I think it's an awsome feature. I simplified the problem space just a tiny bit by excluding it from Resin. I regret that decision a little bit.
Going from a stream to a byte array and then back to a stream is just silly coding. It was written in a proof of concept state of mind.
Even though I haven't taken the time for proper profiling in a while you can witness the amount of work GC has to put up withwhile writing just by watching task mgr's view of the CPU activity, which is a saw pattern going up and down, up and down. Lucene's CPU graph is smooth.
Many allocations hotspots are easily removed. In other places there is a need of a immutable structure, a char array.
Very good good pointers again.
Comment preview