Reviewing Resin: Part II

Jul 12 2017

Reviewing ResinPart II

time to read 3 min | 564 words

In the first pat of this series, I looked into how Resin is tokenizing and analyzing text. I’m still reading the code from the tests (this is because the Tests folder sorted higher then the Resin folder, basically) and I now moved to the second file, CollectorTests.

That one has a really interesting start:

There are a lot of really interesting things here, UpsertTransaction, document structure, issuing queries, etc. UpsertTransaction is a good place to start looking around, so let us poke in. When looking at it, we can se a lot of usage in the Utils class, so I’ll look at that first.

This is probably a bad idea. While using the current time ticks seems like it would generate ever increasing values, that is actually not the case, certainly not with local time (clock shift, daylight saving, etc). Using that for the purpose of generating a file id is probably a mistake. It is better to use our own counter, and just keep track of the last one we used on the file system itself.

Then we have this:

It took me a while to figure out what was going on there, and then more to frantically search where this is used. Basically, this is used in fuzzy searches, and it will allocate a new instance of the string on each call. Given that fuzzy search is popular in full text search usage, and that this is called a lot during any such search, this is going to allocate like crazy. It would be better to move the entire thing to using mutable buffers, instead of passing strings around.

Then we go to the locking, and I had to run it a few times to realize what is going on.

And this isn’t the way to do this at all. Basically, this relies on the file system to fail when you are trying to copy a file into an already existing file. However, that is a really bad way to go about doing that. The OS and the file system already have locking primitives that you can use, and they are going to be much better then this option. For example, consider what happens after a crash, is the directory locked or not? There is no real way to answer that, since the process might have crashed, leaving the file in place, or it might be doing things, expected that this is locked.

Moving on, we have this simple looking method:

I know I’m harping on that, but this method is doing a lot of allocations by using lambdas, and depending on the number of files, the delegate indirection can be quite costly. For that matter, there is also the issue of error handling. If there is a lock file in this directory when this is called, this will throw.

Our final code for this post is:

I really don’t like this code, it is something that look like it is cheap, but it will:

Sort all the index files in the folder
Open all of them
Read some data
Sum over that data

Leaving aside that the deserialization code has the typical issue of not checking that the the entire buffer was read, this can cause a lot of I/O on the system, but luckily this function is never called.

Okay, so we didn’t actually get to figure out what UpsertTransaction is, we’ll look at that in the next post.

Tweet Share Share 7 comments

Tags:

More posts in "Reviewing Resin" series:

(20 Jul 2017) Part VI – Analyzing I/O and being unfair
(19 Jul 2017) Part V
(18 Jul 2017) Part IV
(14 Jul 2017) Part III
(12 Jul 2017) Part II
(11 Jul 2017) Part I

Comments

12 Jul 2017
10:38 AM

Marcus Lager

Last time I profiled Lucene ~seven years ago there was nothing but string allocations. That and exceptions. When I was prototyping Resin I therefore figured I could afford to care little about those. Lambda's were a great way to move forward fast, to resolve complicated matters swiftly.

I figure now it's time to zoom in on the algorithms and of course also the memory management.

Very good pointers.

12 Jul 2017
12:12 PM

Marcus Lager

The current scheme around Util.GetNextChronologicalFileId is indeed a big fat hack that will not do at all once pressure is on nor when in a distributed environment.

What I want is to pop a number from a monotonically increasing number series in a lock free but thread safe manner. The Ticks has played it's part as a stand-in. What's a really solid alternative?

12 Jul 2017
12:21 PM

Stephen Ward

@Marcus

only thread-safe? is the db only supposed to be a single-node db?

12 Jul 2017
12:43 PM

Marcus Lager

The popping of a number should be thread-safe is what I mean, to allow many write threads in parallel.

I'm going for Resin to either have support directly in the API for a distributed scenario or for it to be treated as a local DB in the way Elasticsearch uses a local db (Lucene) on all of its nodes.

12 Jul 2017
12:58 PM

Marcus Lager

To resolve the Tick collision case one could use the same mechanisms as when resolving hash collisions. But then we are back to the other requirement, where Resin needs to be able to live in an abide by the rules of a distributed system.

To solve that one could synchronise the nodes' machine clocks with a time service. After that, is it still problematic in your view to use ticks as timestamps?

12 Jul 2017
13:04 PM

Oren Eini

Marcus, Interlocked.Increment is the way to go here.

12 Jul 2017
13:08 PM

Oren Eini

Marcus, You might want to look at what it takes to handle time properly in the Google TrueTime paper. It is complex and hard.

Instead, you want to use something like Snowflake or SequentialUuid or something like that.

What we did was define a vector clock, so each node has its own incrementing value, and we tell them apart using node id.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB