Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,630 | Comments: 48,356

filter by tags archive

Daisy chaining data flow in RavenDB

time to read 4 min | 685 words

I have talked before about RavenDB’s MapReduce indexes and their ability to output results to a collection as well as RavenDb’s ETL processes and how we can use them to push some data to another database (a RavenDB database or a relational one).

Bringing these two features together can be surprisingly useful when you start talking about global distributed processing. A concrete example might make this easier to understand.

Imagine a shoe store (we’ll go with Gary’s Shoes) that needs to track sales across a large number of locations. Because sales must be processed regardless of the connection status, each store hosts a RavenDB server to record its sales. Here is the geographic distribution of the stores:

img06

To properly manage this chain of stores, we need to be able to look at data across all stores. One way of doing this is to set up external replication from each store location to a central server. This way, all the data is aggregated into a single location. In most cases, this would be the natural thing to do. In fact, you would probably want two-way replication of most of the data so you could figure out if a given store has a specific shoe in stock by just looking at the local copy of its inventory. But for the purpose of this discussion, we’ll assume that there are enough shoe sales that we don’t actually want to have all the sales replicated.

We just want some aggregated data. But we want this data aggregated across all stores, not just at one individual store. Here’s how we can handle this: we’ll define an index that would aggregate the sales across the dimensions that we care about (model, date, demographic, etc.). This index can answer the kind of queries we want, but it is defined on the database for each store so it can only provide information about local sales, not what happens across all the stores. Let’s fix that. We’ll change the index to have an output collection. This will cause it to write all its output as documents to a dedicated collection.

Why does this matter? These documents will be written to solely by the index, but given that they are documents, they obey all the usual rules and can be acted upon like any other document. In particular, this means that we can apply an ETL process to them. Here is what this ETL script would look like.

img07

The script sends the aggregated sales (the collection generated by the MapReduce index) to a central server. Note that we also added some static fields that will be helpful on the remote server so as to be able to tell which store each aggregated sale came from. At the central server, you can work with these aggregated sales documents to each store’s details, or you can aggregate them again to see the state across the entire chain.

The nice things about this approach are the combination of features and their end result. At the local level, you have independent servers that can work seamlessly with an unreliable network. They also give store managers a good overview of their local states and what is going on inside their own stores.

At the same time, across the entire chain, we have ETL processes that will update the central server with details about sales statuses on an ongoing basis. If there is a network failure, there will be no interruption in service (except that the sales details for a particular store will obviously not be up to date). When the network issue is resolved, the central server will accept all the missing data and update its reports.

The entire process relies entirely on features that already exist in RavenDB and are easily accessible. The end result is a distributed, highly reliable and fault tolerant MapReduce process that gives you aggregated view of sales across the entire chain with very little cost.

Code that? It is cheaper to get a human

time to read 3 min | 597 words

Rafal had a great comment on my previous post:

Much easier with humans in the process - just tell them to communicate and they will figure out how to do it. Otherwise they wouldn't be in the shoe selling business. Might be shocking for the tech folk, but just imagine how many pairs of shoes they would have to sell to pay for a decent IT system with all the features you consider necessary. Of course at some point the cost of not paying for that system will get higher than that…

This relates to have a chain of shoe stores that need to sync data and operations among the different stores.

Indeed, putting a human in the loop can in many cases be a great thing. A good example of that can be in order processing. If I can write just the happy path, I can be done very quickly. Anything not in the happy path? Let a human deal with that. That cut down costs by so much, and it allow you to make intelligent decisions on the spot, with full knowledge of the specific case. It is also quite efficient, since most orders fall into the happy path. It also means that I can come back in a few months and figure out what the most common reasons to fall off the happy path are and add them to the software, reducing the amount of work I shell to humans significantly.

I wish that more systems were built like that.

It is also quite easy to understand why they aren’t built with this approach. Humans are expensive. Let’s assume that we can pay a minimum wage, in the states, that would translate to about 20,000 USD. Note that I’m talking about the actual cost of employment, this calculation includes the salary, taxes, insurance, facilities, etc. If I need this to be a 24/7, I have to at least triple it (without accounting for vacation, sick leave, etc).

At the same time, x1e.16xlarge machine on AWS with 64 cores and 2 TB of memory will set me back by 40,000 a year. And it will likely be able to process transactions much faster than the two minimum wage employees that the same amount of money will get me.

Consider the case of a shoe store and misdirected check scenario, we need to ensure that the people actually receiving the check understand that this is meant for the wrong store and take some form of action. That means that we can just take Random Joe Teenager off the street. So another aspect to consider is the training costs. That usually means getting higher quality people and training them on your policies. All of which take quite a bit of time and effort. Especially if you want to have consistent behavior across the board.

Such a system, taken to extreme, result in rigid policy without a lot of place for independent action on the part of the people doing the work. I wish I could say that taking it to extreme was rare, but all you have to do is visit the nearest government office or bank or the post office to see common examples of people working working within very narrow set of parameters. The metric for that, by the way, is the number of times that you hear: “There is nothing I can do, these are the rules” per hour.

In such a system, it is much cheaper to have a rigid and inflexible system running on a computer. Even with the cost of actually building the system itself.

RavenDB 4.1 featuresJavaScript Indexes

time to read 3 min | 600 words

Note: This feature is an experimental one. It will be included in 4.1, but it will be behind an experimental feature flag. It is possible that this will change before full inclusion in the product.

RavenDB now supports multiple operating systems and we spend a lot of effort to bring RavenDB client APIs to more platforms. C#, JVM and Python are already done, Go, Node.JS and Ruby are in various beta stages. One of the things that this brought up was our indexing structure. Right now, if you want to define a custom index in RavenDB, you use C# Linq syntax to do so. When RavenDB was primarily focused on .NET, that was a perfectly fine decision. However, as we are pushing for more platforms, we wanted to avoid forcing users to learn the C# syntax when they create indexes.

With no further ado, here is a JavaScript index in RavenDB 4.1:

As you can see, this is pretty simple translation between the two. It does make certain set of operations easier, since the JavaScript option is a lot more imperative. Consider the case of this more complex index:

You can see here the interplay of a few features. First, instead of just selecting a value to index, we can use a full fledged function. That means that you can run your complex computation during index more easily. Features such as loading related documents are there, and you can see how we use reduce to aggregate information as part of the indexing function.

JavaScript’s dynamic nature gives us a a lot of flexibility. If you want to index fields dynamically, just do so, as you can see here:

MapReduce indexes work along the same concept. Here is a good example:

The indexing syntax is the only thing that changed. The rest is all the same. All the capabilities and features that you are used to are still there.

JavaScript is used extensively in RavenDB, not surprisingly. That is how you patch documents, do projections and manage subscription. It is also a very natural language to handle JSON documents. I think that it is a pretty fair to assume that anyone who uses RavenDB will have at least a passing familiarity with JavaScript, so that make it easier to get how indexing work.

There is also the security aspect. JavaScript is much easier to control and handle in an embedded fashion. The C# indexes are allowing users to write their own code that RavenDB will run. That code can, in theory, do anything. This is why index creation is an admin level operation. With JavaScript indexes, we can allow users to run their computation without worrying that they will do something that they shouldn’t. Hence, the access level required for creating JavaScript indexes is much lower.

Using JavaScript for indexing does have some performance implications. The C# code is faster, generally, but not much faster. The indexing function isn’t where we usually spend a lot of time when indexing, so adding a bit of additional work there (interpreting JavaScript) doesn’t hurt us too badly. We are able to get to speeds of over 80,000 documents / second using JavaScript indexes, which should be sufficient. The C# indexes aren’t going anywhere, of course. They are still there and can provide additional flexibility / power as needed.

Another feature that might be very useful is the ability to attach additional sources to an index. For example, you may really like a sum using lodash. You can add the lodash.js file as an additional file to an index, and that would expose the library to the indexing functions.

RavenDB 4.1 featuresSQL Migration Wizard

time to read 2 min | 234 words

One of the new features coming up in 4.1 is the SQL Migration Wizard. It’s purpose is very simple, to get you started faster and with less work. In many cases, when you start using RavenDB for the first time, you’ll need to first put some data in to play with. We have the sample data which is great to start with, but you’ll want to use you own data and work with that. This is what the SQL Migration Wizard is for.

You start it by pointing it at your existing SQL database, like so:

image

The wizard will analyze your schema and suggest a document model based on that. You can see how this looks like here:

image

In this case, you can see that we are taking a linked table (employee_privileges) and turning that into an embedded collection.  You also have additional options and you’ll be able to customize it all.

The point of the migration wizard is not so much to actually do the real production migration but to make it easier for you to start playing around with RavenDB with your own data. This way, the first step of “what do I want to use it for” is much easier.

Roadmap for RavenDB 4.1

time to read 2 min | 227 words

imageWe are gearing up to start work on the next release of RavenDB, following the 4.0 release. I thought this would be a great time to talk about what are the kind of things that we want to do there. This is going to be a minor point release, so we aren’t going to shake things up.

The current plan is to release 4.1 about 6 months after the 4.0 release, in the July 2018 timeframe.

Instead, we are planning to focus on the following areas:

  • Performance
    • Moving to .NET Core 2.1 for the performance advantages this gives us.
    • Start to take advantage of the new features such as Span<T>, etc in .NET Core 2.1.
    • Updating the JavaScript engine for better query / patch performance.
  • Wild card certificates via Let’s Encrypt, which can simplify cluster management when RavenDB generates the certificates.
  • Restoring highlighting support

We are also going to introduce the notion of experimental features. That is, features that are ready from our perspective but still need some time out in the sun getting experience in production. For 4.1, we have the following features slated for experimental inclusion:

  • JavaScript indexes
  • Distributed counters
  • SQL Migration wizard

I have a dedicated post to talk about each of these topics, because I cannot do them justice in just a few words.

Open sourcing code is a BAD default policy

time to read 11 min | 2011 words

imageI run into this Medium post that asks: Why is this code open-sourced? Let’s flip the question. The premise of the post is interesting, given that the author asks that the default mode for code is that it should be open source. I find myself in the strange position of being a strong open source adherent that very strong disagree on pretty much every point in this article. Please sit tight, this may take a while, this article really annoyed me.

Just to clear the fields, I have been working on open source software for the past 15 years. The flagship product that we make is open source and available on GitHub and we practice a very open development process. I was also very active in a number of high profile open source projects for many years and had quite a few open source projects that I had built and released on my own. I feel that I’m quite qualified to talk from experience on this subject.

The quick answer for why the default for a codebase shouldn’t be open source is that it costs. In fact, there are several very different costs around that.

The most obvious one is the reputation cost for the individual developer. If you push bad stuff out there (like this 100+ lines method) that can have a real impact on people perception on you. There is a very different model for internal interaction inside the team and stuff that is shown externally, without the relevant context. A lot of people don’t like this exposure to external scrutiny. That leads to things like: “clean up the code before we can open source it”.  You can argue that this is something that should have been done in the first place, but that don’t change the fact that this is a real concern and add more work to the process.

Speaking of work, just throwing code over the wall is easy. I’m going to assume that the purpose isn’t to just do that. The idea is to make something useful and that means that aside from the code itself, there is also a lot of other aspects that needs to be handled. For example, engaging the community, writing documentation, ensuring that the build process can run on a wide variety of machines. Even if the project is only deployed on Ubuntu 16.04, we still need to update the build script for that MacOS guy. Oh, this is open source and they sent us a PR to fix that. Great, truly. But who is going to maintain that over time?

Open source is not an idyllic landscape that you  dump your code in and someone else is going to come and garden it for you.

And now, let me see if I can answer the points from the article in detail:

  • Open-source code is more accessible - Maintainers can get code reviews … consumers from anywhere in the world … can benefit from something I was lucky enough to be paid for building.

First, drive by code reviews are rare. As in, they happen extremely infrequently. I know that because I do them for interesting projects and I explicitly invited people to do the same for my projects and had very little response. People who are actually using the software will go in and look at the code (or some parts of it) and that can be very helpful, but expecting that just because your code is open source you’ll get reviews and help is setting yourself for failure.

There is also the interesting tidbit there about consumers benefiting from something that the maintainers were paid to build. That part is a pretty important one. Because there is a side here in the discussion that hasn’t been introduced. We had maintainers and consumers, but what about the guy who end up paying the bills? I mean, given that this is paid work, this isn’t the property of the maintainer, it belongs to the people who actually paid for the work. So any discussion on the benefits of open sourcing the code should start from the benefits for these people.

Now, I’m perfectly willing to agree (in fact, I do agree, since my projects are in the open) that there are good and valid reasons to want to open source a project and community feedback is certainly a part of that. But any such discussion should start with the interests of the people paying for the code and how it helps them. And part of that discussion should involve the real and non trivial costs of actually open sourcing a project.

  • Open-source code keeps us healthy - Serotonin and Oxytocin are chemicals in the brain that make you feel happy and love. Open source gives you that.

I did a bad job summarizing this part, quite intentionally. Mostly because I couldn’t quite believe what I was reading. The basic premise seems to be that by putting your code out there you open yourself to the possibility of someone seeing your code and sending you a “Great Job” email and making your day.

I… guess that can happen. I certainly enjoy it when it happens, sure. Why would I say no to something like that?

Well, to start with, it happens, sure, but it isn’t a major factor in the decision making process. I’ll argue that if you think that compliments from random strangers are so valuable, just get in and out of Walmart in a loop. There are perfect strangers there that will greet you every single time. Why wouldn’t you want to do that?

More to the point, even assuming that you have a very popular project and lots of people write you how awesome you are, this gets tiring fast. What is worse is you throwing code over the wall and expecting the pat in the back. But no one care, actually getting them to care takes a whole lot of additional work.

And we haven’t even mentioned that other side of open source project. The users who believe that just because your code is open source they are entitled for all your time and effort (for free). And you are expected to fix any issues they find (immediately, of course) and are quite rude and obnoxious. There aren’t a lot of them, but literally any open source project that has anything but the smallest of following will have to handle them at some point. And often dealing with such a disappointed user means dealing with abuse. That can be exhausting and painful.

Above I pointed out a piece of code in the open that is open to critique. This is a piece of code that I wrote, so I feel comfortable telling you that it isn’t so good. But imagine that I took your code and did that? If is very easy to get offended by this, even when there was no intent to offend.

  • Open-source code is more maintainable – Lots of tools are free for OSS projects

So? This is only ever valuable if you assume that tooling are expensive (they aren’t). The article mentions tools such as Travis-CI, Snyk, Codecov and Dependencies.io that are offering free tier for open source projects. I went ahead and priced these services for a year for the default plans for each. The total yearly cost of all of them was around $8,000. That is a lot of money. But that is only assuming that you are an individual working for free. Assuming that you are actually getting paid, the cost of such tools and services is miniscule compared to other costs (such as developer salaries).

So admittedly, this is a very nice property of open source projects, but it isn’t as important as you might imagine it would be. In a team of five people, if the effort to open source the project is small, only taking a couple of weeks, it will take a few years to recoup that investment in time (and I’m ignoring any additional effort to run the open source portion of the project).

  • Open-source code is a good fit for a great engineering culture

Well, no. Not really. You can have a great engineering culture without having open source and you can have a really crappy engineering with open source. They sometimes go in tandem, but they aren’t really related. Investing in the engineering culture is probably going to be much more rewarding for a company that just open sourcing projects. Of particular interest to me is this quote:

Engineers are winning because they can autonomously create great projects that will have the company’s name on it: good or bad…

No, engineers do not spontaneously create great projects. That come from hard work, guidance and a lot of surrounding infrastructure. Working in open source doesn’t meant that you don’t need coordination, high level vision and good attention for detail. This isn’t a magic sauce.

What is more, and that is really hammering the point home: good or bad. Why would a company want to attach it’s name to something that can be good or bad? That seems like a very unnecessary gamble. So in order to avoid publicly embarrassing the company, there will be the need to do the work to make sure that the result is good. But the alternative to that is not to have a bad result. The alternative to that is to not open source the code.

Now, you might argue that such a thing is not required if the codebase is good to begin with, and I’ll agree. But then again, you have things like this that you’ll need to deal with. Also, be sure that you cleaned up both the code and the commit history.

  • Just why not

The author goes on to gush about the fact that there are practically no reasons why not to go open source, that we know that projects such as frameworks, languages, operating systems and databases are all open source and are very successful.

I think that this gets to the heart of the matter. There is the implicit belief that the important thing about an open source project is the code. That couldn’t be further from the truth. Oh, of course, the code is the foundation of the project, but foundations can be replaced (see: FireFox, OpenSsl –> BoringSsl, React, etc).

The most valuable thing about an open source project is the community. The contributors and users are the thing that make a project unique and valuable. In other words, to misquote Clinton, it’s the community, stupid. 

And a community doesn’t just spring up from nowhere, it takes effort, work and a whole lot of time to build. And only when you have a community of sufficient size will you start to see actual return of investment for your efforts. Until that point, all of that is basic sunk cost.

I’m an open source developer, pretty much all the code I have written in the past decade or so is under one open source license or another and is publicly available. And with all that experience behind me I can tell you what really annoyed me the most about this article. It isn’t an article about promoting open source. It is an article that, I feel, promotes just throwing code over the wall and expecting flowers to grow. That isn’t the right way to do things. And it really bugged me that in all of this article there wasn’t a single word about the people who actually paid for this code to be developed.

Note that I’m not arguing for closed source solutions for things like IP, trade secrets, secret sauce and the like. These are valid concerns and needs to be addressed, but that isn’t the issue. The issue is that open sourcing a project (vs. throwing the code to GitHub) is something that should be done in a forthright manner. With clear understand of the costs, risks and ongoing investment involved. This isn’t a decision you make because you don’t want to pay for a private repository on GitHub. 

Times are hard

time to read 2 min | 277 words

One of the things RavenDB does is allow you to define a backup task that will be executed on a given schedule (such as every Saturday at midnight). However, as it turns out, specifying the right time is actually a pretty hard thing to do. The problem is what to do when you have multiple time zones involved:

  • UTC
  • The server local time
  • The operator’s local time
  • The business hours of the application using the database

In some cases, you might have a server in Germany being managed from Japan with users primarily from South Africa. There are at least four different options for when Saturday’s midnight is, and the one sure thing is that it will happen when you least want it to.

Because of that, RavenDB takes the simple positon that the time that it cares about is the server's own time. An operator is free to define it as they wish, but only the server local time is relevant. But we still need to make the operator’s job easier, and we do it using the following method:

image

The operator can specify the time specification using CRON syntax (which should be common to most admins). We translate the CRON syntax to a human readable string, but we also provide the next backup date with the server’s time (when it will actually run), the operator’s local time (which as you can see is a bit different from the server) and the duration. The later is actually really important because it gives the operator an intuitive understanding of when the backup is going to run next.

Avoid a standalone DbService process

time to read 3 min | 520 words

imageThe trigger for this post is the following question in the RavenDB mailing list. Basically, given a system that is composed of multiple services (running as separate processes), the question is whatever have each service use its own DocumentStore or have a separate service (DbService) process that will encapsulate all access to RavenDB. The idea, as I understand it, is to avoid the DocumentStore creation because it is expensive.

The quick answer here is simple: <blink*>Don’t ever do that!</blink>

* Yes, I’m old.

That is all, you don’t need to read the rest of this post.

Oh, you are still here, as long as you are here, let me explain my reasoning for such a reaction.

DocumentStore isn’t actually expensive to create. In fact, for most purposes, it is actually quite cheap. It holds no network resources on its own (connection pooling is handled by a global pool, anyway). All it does is manage the http cache on the client, cache things like serialization information, etc.

The reason we recommend that you won’t create document stores all the time is that we saw people creating a document store for the purpose of using a single session and then disposing it. That is quite wasteful, it forces us to allocate more memory and avoid the use of caching entirely. But creating a few document stores for each service that you have? That is cheap to do.

What really triggered this post is the idea of having a separate process just to host the DocumentStore, the DbService process. This is a bad idea. Let me count the ways.

Your service process needs some data, so it will go to the DbService (over HTTP, probably) and ask for it. Your DbService will then call to RavenDB to get the data using the normal session and return the data to the original service. That service will process the data, maybe mutate it and save it back. It will have to do that by sending the data back to the DbService process, which will create a new session and save it to RavenDB.

This is adding another round trip to every database query, it means that you can’t natively express queries inside your service (since you need to send it to the DbService). It creates strong ties between all the services you have the the DbService, as well as a single point of failure. Even if you have multiple copies of DbService, you now need to write the code to do automatic failover between them. Updating a field in a class for one service means that you have to deploy the DbService to recognize the new field, for example.

In terms of client code, aside from having to write awkward queries, you also need to deal with serialization costs, and you have to write your own logic for change tracking, unit of work, etc.

In other words, this has all the disadvantages of a repository pattern with the added benefit of making many remote calls and seriously complicating deployment.

Soliciting feedback about RavenDB 4.0 and TODOs for 4.1

time to read 1 min | 76 words

With RavenDB 4.0 out and about for a few months already, we have been mostly focused on finishing up the release. That meant working on documentation (the book is already past the 500 pages mark!), additional clients, helping clients to go to production with 4.0 and gathering feedback.

In fact, this is the point of this post today. I would really like to know your thoughts about RavenDB 4.0 and what should go into the next version?

Rejection, dejection and resurrection, oh my!

time to read 4 min | 607 words

imageRegardless of how good your software is, there is always a point where we can put more load on the system than it is capable of handling.

One such case is when you are firing about a hundred requests a second, per second, regardless of whatever the previous requests have completed and at the same time throttling the I/O so we can’t complete the requests fast enough.

What happens then is known as a convoy. Requests start piling up, as more and more work is waiting to be done, we are falling further and further behind. The typical way this ends is when you run out of resources completely. If you are using thread per requests, you end up with all your threads blocked on some lock. If you are using async operations, you start consuming more and more memory as you hold the async state of the request until it is completed.

We put a lot of pressure on the system, and we want to know that it responds well. And the way to do that is to recognize that there is a convoy in progress and handle it. But how can you do that?

The problem is that you are currently in the middle of processing a set of operations in a transaction. We can obviously abort it, and roll back everything, but the problem is that we are now in the second stage. We have a transaction that we wrote to the disk, and we are waiting for the disk to come back and confirm that the write is successful while already speculatively executing the current transaction. And we can’t abort the transaction that we are currently writing to disk, because there is no way to know at what stage the write is. 

So we now need to decide what to do. And we choose the following set of behaviors. When running a speculative transaction (a transaction that is run while the previous transaction is being committed to disk) we observe the amount of memory that is used by this transaction. If the amount of memory being used it too high, we stop processing incoming operations and wait for the previous transaction to come back from the disk.

At the same time, we might still be getting new operations to complete, but we can’t process them. At this point, after we waited for enough time to be worrying, we start proactively rejecting requests, telling the client immediately that we are in a timeout situation and that they should failover to another node.

The key problem is that I/O is, by its nature, highly unpredictable, and may be impacted by many things. On the cloud, you might hit your IOPS limits and see a drastic drop in performance all of a sudden. We considered a lot of ways to actually manage it ourselves, by limiting what kind of I/O operations we’ll send at each time, queuing and optimizing things, but we can only control the things that we do. So we decided to just measure what is going on and react accordingly.

Beyond being proactive to incoming requests, we are also making sure that we’ll surface these kind of details to the user:

image

Knowing that the I/O system may be giving us this kind of response can be invaluable when you are trying to figure out what is going on. And we made sure that this is very clearly displayed to the admin.

FUTURE POSTS

  1. Distributed compare-exchange operations with RavenDB - 10 hours from now
  2. I WILL have order: How Lucene sorts query results - about one day from now
  3. I WILL have order: How Noise sorts query results - 4 days from now

There are posts all the way to May 28, 2018

RECENT SERIES

  1. RavenDB 4.1 features (4):
    22 May 2018 - Highlighting
  2. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  3. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
  4. Challenge (52):
    03 Apr 2018 - The invisible concurrency bug–Answer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats