Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,887 | Comments: 49,276

filter by tags archive
time to read 1 min | 142 words

imageYou can now read the Inside RavenDB directly in your browser. 

I’m really happy about this, not just because you can browse the full book online (or download to PDF) completely free. The main point is that now I can link directly to the specific part in the book where I’m discussing (in depth) certain features of RavenDB.

I think that this is going to make answering questions about RavenDB’s internal and behavior a lot easier and more approachable.

It also means, of course, that you can use Google to find information from the book.

I’m also currently working on updating the book for RavenDB 5.0. Although I’ll admit that in some cases I’m writing about features that haven’t yet seen the light of day.

time to read 1 min | 112 words

Consider the following C code snippet:

image

This code cannot be written in C#. Why? Because you can’t use ‘+’ on bool, and you can’t cast bools. So I wrote this code, instead:

And then I changed it to be this code:

Can you tell why I did that? And what is the original code trying to do?

For that matter (and I’m honestly asking here), how would you write this code in C# to get the best performance?

Hint:

image

time to read 4 min | 731 words

Product recommendations is a Big Thing. The underlying assumption is that there are patterns in the sales of products, so we can detect and recommend what products usually go together. That gives us a very nice way to give accurate recommendations to users about products that they might want to purchase.

Here is a great example of how this may look like, from Amazon:

image
As an aside, I’m really happy to see the grouping of my book with Release It~ and Writing High Performance .Net Core books. 

An interesting question is can we get this kind of behavior in RavenDB? If we were using SQL, we could probably write some queries to handle this. I wrote about this a decade ago with NHiberante, and the queries are… complex. They also have non trivial amount of runtime costs. With RavenDB, however, we can do things differently. We can use RavenDB’s map/reduce feature to handle this.

The key observation is that we want to gather, for each product, the products that were also purchased with it. We’ll use the sample dataset to test things out. There, we have an Orders collection and each order has a list of Lines that were purchased in the order. Given that information, we can use the following index definition:

Let’s break this index apart to its constituent parts. In the map, we project an entry for each line, which has the Product that is being purchased as well as all the other products that were purchased in the same order. We use this to create a link between the various products that are sold together. In the reduce, we group by the product that was sold, and aggregate the sales of related products to get the final tally.

The end result will looks like so:

image

You can see some interesting design decisions in how I built this index. We keep track of the number of orders for each product, as well as the number of times it was purchased along side each related product. This means that we can very easily implement related products, but also filter outliers. If someone purchased the “Inside RavenDB” book to learn RavenDB, but at the same time also bought the Hungry Caterpillar for their child, you probably don’t want to put recommend each other. The audiences are quite different (even though telling my own 4 years old daughter about RavenDB usually puts her to sleep pretty quickly Smile).

We can use the number of joint sales as a good indication of whatever the products are truly related, all the while using the users tell us what matter. And the best part, you don’t have to go out of your way to get this information. This is based on pretty much just the data that you are already collecting.

Because this is a map/reduce index in RavenDB, the computation happens at indexing time, not at runtime. This means that the cost of querying this information is minimal, and RavenDB will make sure that it is always up to do.

In fact, we can go to the Map/Reduce Visualizer page in RavenDB to see how this works. Let’s take a peek, shall we?

image

Here we can see a visual representation of two orders for the same product, as well as a few others. This is exactly the kind of thing we want to explore. Let’s look a bit deeper, just for products/51-A:

image

You can see how for the first order (bottom left), we have just one additional product, (products/14-A) while the second has a couple of them. We aggregate that information (Page #593) for all the 490 orders that fit there. There is also the top level (Page #1275) which aggregate the data from all the leaves.

When we query, we will get the data from the top, so even if we have a lot of orders, we don’t actually need to run any costly computation. The data is already pre-chewed for us and immediately (and cheaply) available.

time to read 1 min | 86 words

I’ll be speaking at the Progressive.NET conference later this week. I’ll be speaking about the nastiest bugs that weren’t my fault. This is a very cathartic talk to give, because I get to go in depth into all the ways I tripped and fell.

This is based on a decade of running RavenDB in production and running into the strangest situations that you can think of.

On the menu:

  • Linux and memory management
  • Windows and the printer
  • The mysterious crash on the ARM robot
  • The GC that smacked me

And much more…

time to read 2 min | 400 words

Hadi’s had an interesting Tweet:

This sounds reasonable on the surface, but it runs into hard issues with the vast differences between any amount of money vs. free. There have been quite a few studies on this. A good reading on the subject can be found here. Take a note of the Amazon experience. Shipping that is free increases sales. Shipping that is 0.2$ does not increase sales. Note that from a practical standpoint, there is no real difference between the prices, but it matters, quite a lot.

Leaving aside the psychology of free, there are also other issues. Let’s say that there is a tool that can save someone on my team 10 minutes a day, it is priced at 1$ / year. By any measure you care to name, that is more than worth it.

But the process of actually getting this purchased is likely to be more expensive than the actual cost of the tool. In my case, the process is “go talk to Oren”. In other environments, that may involve “talk to your boss, that will submit it to accounts payable, which will pay it in 60 days”.

And at that point, charging 1$ just doesn’t matter. You could charge 50$, and the “cost” for the person making the purchase would be pretty much the same. Note that this is for corporate environment, the situation is (slightly) different for sales directly to consumers, but not significantly so. Free stills trumps everything.

When talking about freemium models, you hear quotes that are bad. For example, Evernote had a < 2% conversion rate, and that is a wildly successful example. Dropbox has a rate of about 4%, and the average seems to be 1%. And that is for businesses who are focused on optimizing the freemium funnel. That takes a lot of time and effort, mind you.

I don’t think that there is a practical option for turning this around, and I say that as someone who would love it if that were at all possible.

time to read 3 min | 474 words

RavenDB, as of 4.0, requires that the document identifier will be a string. In fact, that has always been the requirement, but in previous versions, we allowed you to pretend that this isn’t the case. That has led to… some complexities, because people had a number id in their model, but inside RavenDB that was represented as a string, always.

I just got the following question:

In my entities, can I have the Id property of any type instead string to avoid primitive obsession? I would use a generic Id<tentity> type for ids. This type can be converted into string before saving in DB by calling ToString() and transformed from string into Id<tentity> (when fetching from DB) by invocation of static method like public Id<tentity> FromString(string id).

The short answer for this is that no, there is no way to do this. A document id in your model has to be a string.

The longer answer is that you can absolutely do this, but you have to understand the divergence of your entity model vs. the document model. The key is that RavenDB doesn’t actually require that your model would have an Id property. It is usually defined, because it makes things easier, but that isn’t required. RavenDB is perfectly happy managing the document key internally. Combine that with the ability to modify how documents are converted to entities, and you have a solution. Let’s look at the code…

And here is how it looks like:

image

The idea is that we customize a few things inside of RavenDB.

  • We tell the serializer that it should ignore the UserId property
  • We tell RavenDB that after creating an entity from the server, we should setup the Id property as we want it.
  • We do the same just before we store the entity in the server, just to be sure that we got the complete package.
  • We disable the usual identity generation logic for the documents we care about and tell RavenDB that it should ignore trying to set the identity property on the document on its own.

The end result is that we have an entity with a strongly typed identifier in our model. It took a bit of work, but not overly so.

That said, I would suggest that you should either have a string identifier property in your model or not have one at all (either option takes no code in RavenDB). Having an identifier and jumping through hoops like that tend to make for awkward experience. For example, RavenDB has no idea about this property, so if you need to support queries as well, you’ll need to extend the query support. It’s possible, but shows that there is additional complexity that can be avoided.

time to read 2 min | 256 words

RavenDB is highly concurrent distributed database. That means that we take the idea of race conditions, multiple that by network hiccups and then raise to the power of hair pulling. Now, we have architectural structure to help with a lot of that, but sometimes you need to write and verify what happens when a particular sequence of events in a five node cluster happens. For fun, you may need to orchestrate a particular order of operations across multiple disparate processes (sometimes on different machines). As you can imagine, that is… challenging.

I wanted to give you a hint of some of the techniques that we use to handle this. We have code that looks like this, sprinkled throughout our code base (Rachis is the name of our Raft cluster implementation):

This is where a leader connects to a follower to setup their relationship:

image

This is called during leader election:

image

These methods are implemented in the following manner:

image

In other words, they will set a ManualResetEvent that we setup as part of our testing infrastructure. The code isn’t even being run on production release, but it allow us to very carefully structure the exact sequence of events that we need to expose specific behaviors in the system.

time to read 2 min | 346 words

I run into this post, in which the author describe how they got ERROR 1000294 from IBM DataPower Gateway as part of an integration effort. The underlying issue was that he sent JSON to the endpoint in an order that it wasn’t expected.

After asking the team at the other end to fix it, the author got back an estimation of effort for 9 people for 6 months (4.5 man years!). The author then went and figured out that the fix for the error was somewhere deep inside DataPower:

Validate order of JSON? [X]

The author then proceeded to question the competency  / moral integrity of the estimation.

I believe that the author was grossly unfair, at best, to the people doing the estimation. Mostly because he assumed that unchecking the box and running a single request is a sufficient level of testing for this kind of change. But also because it appears that the author never considered once what is the reason this setting may be in place.

  • The sort order of JSON has been responsible for Remote Code Execution vulnerabilities.
  • The code processing the JSON may not do that in a streaming fashion, and therefor expect the data in a particular order.
  • Worse, the code may just assume the order of the fields and access them by index. Change the order of the fields, and you may reverse the Creditor and Debtor fields.
  • The code may translate the JSON to another format and send it over to another system (likely, given the mentioned legacy system.

The setting is there to protect the system, and unchecking that value means that you have to check every single one of the integration points (which may be several layers deep) to ensure that there isn’t explicit or implied ordering to the JSON.

In short, given the scope and size of the change:  “Fundamentally alter how we accept data from the outside world”, I can absolutely see why they gave this number.

And yes, for 99% of the cases, there isn’t likely to be any different, but you need to validate for that nasty 1% scenario.

time to read 2 min | 339 words

imageI used the term “Big Red Sales Button” in a previous post, and got a question about it. You can see an illustration of that on the right.

The Big Red Sales Button (BRSB from now on) is a metaphor used to discuss how sales can impact an organization. It is common for the sales team to run into new customer requirements. Some of them are presented as absolute requirements (they usually aren’t).

I have found that the typical response of the sales person at this point is to reply “of course we can do that”, go back to the office and hit the BRSB and notify the dev team that they have $tooShortTimeFrame to implement said feature.

In one very memorable case, I remember going over contract details and trying to figure out what we need to do there. Right there, in a minimum seven figures contract, there was a clause that explained what the core functionality of the system and the set of features that were required for it to be accepted.

Most of it was pretty normal business application, nothing too strange. But section 5.1.3.c was interesting. In it, in dense legalese, there was a requirement to solve the traveling salesman problem. To be fair, it wasn’t actually that problem, it was a scheduling problem and I used the traveling salesman as the name for it because it is easier than trying to explain NP complete issues to layman.

I’ll spoil the ending of this post and reveal that I did not solve an NP complete problem. I cheated like hell and actually solved the issue they had (if you limit the solution space, you drastically simplify the cost of a solution).

Sometimes, the BRSB is used for a good purpose. If you have something that can help close a major sale, and it isn’t outrageous to implement it, for example. But in many cases, it is open for abuse.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. re (22):
    19 Aug 2019 - The Order of the JSON, AKA–irresponsible assumptions and blind spots
  2. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  3. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  4. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
  5. Reviewing Sled (3):
    23 Apr 2019 - Part III
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats