Reviewing ResinPart II
In the first pat of this series, I looked into how Resin is tokenizing and analyzing text. I’m still reading the code from the tests (this is because the Tests folder sorted higher then the Resin folder, basically) and I now moved to the second file, CollectorTests.
That one has a really interesting start:
There are a lot of really interesting things here, UpsertTransaction, document structure, issuing queries, etc. UpsertTransaction is a good place to start looking around, so let us poke in. When looking at it, we can se a lot of usage in the Utils class, so I’ll look at that first.
This is probably a bad idea. While using the current time ticks seems like it would generate ever increasing values, that is actually not the case, certainly not with local time (clock shift, daylight saving, etc). Using that for the purpose of generating a file id is probably a mistake. It is better to use our own counter, and just keep track of the last one we used on the file system itself.
Then we have this:
It took me a while to figure out what was going on there, and then more to frantically search where this is used. Basically, this is used in fuzzy searches, and it will allocate a new instance of the string on each call. Given that fuzzy search is popular in full text search usage, and that this is called a lot during any such search, this is going to allocate like crazy. It would be better to move the entire thing to using mutable buffers, instead of passing strings around.
Then we go to the locking, and I had to run it a few times to realize what is going on.
And this isn’t the way to do this at all. Basically, this relies on the file system to fail when you are trying to copy a file into an already existing file. However, that is a really bad way to go about doing that. The OS and the file system already have locking primitives that you can use, and they are going to be much better then this option. For example, consider what happens after a crash, is the directory locked or not? There is no real way to answer that, since the process might have crashed, leaving the file in place, or it might be doing things, expected that this is locked.
Moving on, we have this simple looking method:
I know I’m harping on that, but this method is doing a lot of allocations by using lambdas, and depending on the number of files, the delegate indirection can be quite costly. For that matter, there is also the issue of error handling. If there is a lock file in this directory when this is called, this will throw.
Our final code for this post is:
I really don’t like this code, it is something that look like it is cheap, but it will:
- Sort all the index files in the folder
- Open all of them
- Read some data
- Sum over that data
Leaving aside that the deserialization code has the typical issue of not checking that the the entire buffer was read, this can cause a lot of I/O on the system, but luckily this function is never called.
Okay, so we didn’t actually get to figure out what UpsertTransaction is, we’ll look at that in the next post.
Comments
Last time I profiled Lucene ~seven years ago there was nothing but string allocations. That and exceptions. When I was prototyping Resin I therefore figured I could afford to care little about those. Lambda's were a great way to move forward fast, to resolve complicated matters swiftly.
I figure now it's time to zoom in on the algorithms and of course also the memory management.
Very good pointers.
The current scheme around Util.GetNextChronologicalFileId is indeed a big fat hack that will not do at all once pressure is on nor when in a distributed environment.
What I want is to pop a number from a monotonically increasing number series in a lock free but thread safe manner. The Ticks has played it's part as a stand-in. What's a really solid alternative?
@Marcus
only thread-safe? is the db only supposed to be a single-node db?
The popping of a number should be thread-safe is what I mean, to allow many write threads in parallel.
I'm going for Resin to either have support directly in the API for a distributed scenario or for it to be treated as a local DB in the way Elasticsearch uses a local db (Lucene) on all of its nodes.
To resolve the Tick collision case one could use the same mechanisms as when resolving hash collisions. But then we are back to the other requirement, where Resin needs to be able to live in an abide by the rules of a distributed system.
To solve that one could synchronise the nodes' machine clocks with a time service. After that, is it still problematic in your view to use ticks as timestamps?
Marcus,
Interlocked.Increment
is the way to go here.Marcus, You might want to look at what it takes to handle time properly in the
Google TrueTime
paper. It is complex and hard.Instead, you want to use something like
Snowflake
orSequentialUuid
or something like that.What we did was define a vector clock, so each node has its own incrementing value, and we tell them apart using node id.
Comment preview