In the first pat of this series, I looked into how Resin is tokenizing and analyzing text. I’m still reading the code from the tests (this is because the Tests folder sorted higher then the Resin folder, basically) and I now moved to the second file, CollectorTests.
That one has a really interesting start:
There are a lot of really interesting things here, UpsertTransaction, document structure, issuing queries, etc. UpsertTransaction is a good place to start looking around, so let us poke in. When looking at it, we can se a lot of usage in the Utils class, so I’ll look at that first.
This is probably a bad idea. While using the current time ticks seems like it would generate ever increasing values, that is actually not the case, certainly not with local time (clock shift, daylight saving, etc). Using that for the purpose of generating a file id is probably a mistake. It is better to use our own counter, and just keep track of the last one we used on the file system itself.
Then we have this:
It took me a while to figure out what was going on there, and then more to frantically search where this is used. Basically, this is used in fuzzy searches, and it will allocate a new instance of the string on each call. Given that fuzzy search is popular in full text search usage, and that this is called a lot during any such search, this is going to allocate like crazy. It would be better to move the entire thing to using mutable buffers, instead of passing strings around.
Then we go to the locking, and I had to run it a few times to realize what is going on.
And this isn’t the way to do this at all. Basically, this relies on the file system to fail when you are trying to copy a file into an already existing file. However, that is a really bad way to go about doing that. The OS and the file system already have locking primitives that you can use, and they are going to be much better then this option. For example, consider what happens after a crash, is the directory locked or not? There is no real way to answer that, since the process might have crashed, leaving the file in place, or it might be doing things, expected that this is locked.
Moving on, we have this simple looking method:
I know I’m harping on that, but this method is doing a lot of allocations by using lambdas, and depending on the number of files, the delegate indirection can be quite costly. For that matter, there is also the issue of error handling. If there is a lock file in this directory when this is called, this will throw.
Our final code for this post is:
I really don’t like this code, it is something that look like it is cheap, but it will:
- Sort all the index files in the folder
- Open all of them
- Read some data
- Sum over that data
Leaving aside that the deserialization code has the typical issue of not checking that the the entire buffer was read, this can cause a lot of I/O on the system, but luckily this function is never called.
Okay, so we didn’t actually get to figure out what UpsertTransaction is, we’ll look at that in the next post.