Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,630 | Comments: 48,356

filter by tags archive

Daisy chaining data flow in RavenDB

time to read 4 min | 685 words

I have talked before about RavenDB’s MapReduce indexes and their ability to output results to a collection as well as RavenDb’s ETL processes and how we can use them to push some data to another database (a RavenDB database or a relational one).

Bringing these two features together can be surprisingly useful when you start talking about global distributed processing. A concrete example might make this easier to understand.

Imagine a shoe store (we’ll go with Gary’s Shoes) that needs to track sales across a large number of locations. Because sales must be processed regardless of the connection status, each store hosts a RavenDB server to record its sales. Here is the geographic distribution of the stores:

img06

To properly manage this chain of stores, we need to be able to look at data across all stores. One way of doing this is to set up external replication from each store location to a central server. This way, all the data is aggregated into a single location. In most cases, this would be the natural thing to do. In fact, you would probably want two-way replication of most of the data so you could figure out if a given store has a specific shoe in stock by just looking at the local copy of its inventory. But for the purpose of this discussion, we’ll assume that there are enough shoe sales that we don’t actually want to have all the sales replicated.

We just want some aggregated data. But we want this data aggregated across all stores, not just at one individual store. Here’s how we can handle this: we’ll define an index that would aggregate the sales across the dimensions that we care about (model, date, demographic, etc.). This index can answer the kind of queries we want, but it is defined on the database for each store so it can only provide information about local sales, not what happens across all the stores. Let’s fix that. We’ll change the index to have an output collection. This will cause it to write all its output as documents to a dedicated collection.

Why does this matter? These documents will be written to solely by the index, but given that they are documents, they obey all the usual rules and can be acted upon like any other document. In particular, this means that we can apply an ETL process to them. Here is what this ETL script would look like.

img07

The script sends the aggregated sales (the collection generated by the MapReduce index) to a central server. Note that we also added some static fields that will be helpful on the remote server so as to be able to tell which store each aggregated sale came from. At the central server, you can work with these aggregated sales documents to each store’s details, or you can aggregate them again to see the state across the entire chain.

The nice things about this approach are the combination of features and their end result. At the local level, you have independent servers that can work seamlessly with an unreliable network. They also give store managers a good overview of their local states and what is going on inside their own stores.

At the same time, across the entire chain, we have ETL processes that will update the central server with details about sales statuses on an ongoing basis. If there is a network failure, there will be no interruption in service (except that the sales details for a particular store will obviously not be up to date). When the network issue is resolved, the central server will accept all the missing data and update its reports.

The entire process relies entirely on features that already exist in RavenDB and are easily accessible. The end result is a distributed, highly reliable and fault tolerant MapReduce process that gives you aggregated view of sales across the entire chain with very little cost.

RavenDB 4.1 FeaturesHighlighting

time to read 2 min | 238 words

This s actually an old feature, that didn’t make the cut to enter 4.0. This is now back, and it is roaring. This is the kind of feature that is useful if you are utilizing RavenDB’s search capabilities. Let us assume that you want to search for something, but instead of querying for “give me all the active users” you want to actually… search. For example, you want to search for all employees with a BA in their bio. However, you don’t want to just get the matches, you want to show the user why this was matches.

That is the problem that highlighting is meant to solve. Consider the following query:

image

Which returns the following results:

image

Why did we get this particular employees?  Let’s find out:

image

Now we are asking the server to highlight for us the reason for the match. You can see this in the studio directly, in the Highlight tab:

image

Using this approach, you can enrich the search result and provide nicer experience for your users.

Inside RavenDB 4.0Book update

time to read 1 min | 66 words

Just to let you know, the book is pretty much edited, that means that you won’t have to suffer through my horrible sentence structure.

You can read this here.

What remains to be done now is for me to go over the book again, verify that there aren’t any issues, and we are done.

In other words, we are now “Done, Done” in the “Done, Done, Done” scale.

Code that? It is cheaper to get a human

time to read 3 min | 597 words

Rafal had a great comment on my previous post:

Much easier with humans in the process - just tell them to communicate and they will figure out how to do it. Otherwise they wouldn't be in the shoe selling business. Might be shocking for the tech folk, but just imagine how many pairs of shoes they would have to sell to pay for a decent IT system with all the features you consider necessary. Of course at some point the cost of not paying for that system will get higher than that…

This relates to have a chain of shoe stores that need to sync data and operations among the different stores.

Indeed, putting a human in the loop can in many cases be a great thing. A good example of that can be in order processing. If I can write just the happy path, I can be done very quickly. Anything not in the happy path? Let a human deal with that. That cut down costs by so much, and it allow you to make intelligent decisions on the spot, with full knowledge of the specific case. It is also quite efficient, since most orders fall into the happy path. It also means that I can come back in a few months and figure out what the most common reasons to fall off the happy path are and add them to the software, reducing the amount of work I shell to humans significantly.

I wish that more systems were built like that.

It is also quite easy to understand why they aren’t built with this approach. Humans are expensive. Let’s assume that we can pay a minimum wage, in the states, that would translate to about 20,000 USD. Note that I’m talking about the actual cost of employment, this calculation includes the salary, taxes, insurance, facilities, etc. If I need this to be a 24/7, I have to at least triple it (without accounting for vacation, sick leave, etc).

At the same time, x1e.16xlarge machine on AWS with 64 cores and 2 TB of memory will set me back by 40,000 a year. And it will likely be able to process transactions much faster than the two minimum wage employees that the same amount of money will get me.

Consider the case of a shoe store and misdirected check scenario, we need to ensure that the people actually receiving the check understand that this is meant for the wrong store and take some form of action. That means that we can just take Random Joe Teenager off the street. So another aspect to consider is the training costs. That usually means getting higher quality people and training them on your policies. All of which take quite a bit of time and effort. Especially if you want to have consistent behavior across the board.

Such a system, taken to extreme, result in rigid policy without a lot of place for independent action on the part of the people doing the work. I wish I could say that taking it to extreme was rare, but all you have to do is visit the nearest government office or bank or the post office to see common examples of people working working within very narrow set of parameters. The metric for that, by the way, is the number of times that you hear: “There is nothing I can do, these are the rules” per hour.

In such a system, it is much cheaper to have a rigid and inflexible system running on a computer. Even with the cost of actually building the system itself.

Data ownership: The story of an invoice

time to read 3 min | 478 words

imageLet’s talk about Gary, and Gary’s Shoes. Gary runs a chain of shoes stores across the nation. As part of refreshing their infrastructure, Gary want to update all the software across the entire chain. The idea is to have a unified billing, inventory, sales and time tracking for the entire chain.

Gary doesn’t spend a lot of time on this (after all, he has to sell shoes), he just installed a sync service between all the stores and HQ to sync up all the data. Well, I call in sync service. What it actually turn out to be is that the unified system is a set of Excel files on a shared DropBox folder.

Feel free to go and wash your face, have a drink, take Xanax. I know this might be a shock to see something like this.

Surprisingly enough, this isn’t the topic of my post. Instead, I want to talk about data ownership here.

Imagine that one of Gary’s stores in Chicago sold a bunch of shoes, then issued an invoice to the customer. They dutifully recorded the order in the Orders.xlsx file with the status “Pending Payment”.

That customer, however, accidently sent the check to the wrong store. No biggie, right?  The clerk at the second store can just go ahead and update the order in the shared system, marking it as “Paid in full”.

As it turns out, this is a problem. And the easiest way to explain why is data ownership. The owner of this particular record is the original store. You might say that this doesn’t matter, after all, the change happened in the same system. But the problem is that this is almost always not the case.

In addition to the operation “system” that you can see on the right, there are other things. The store manager still have a PostIt note to call that customer and ask about the missing payment. The invoice that was generated need to be closed, etc. Just updating it in the system isn’t going to cause all of that to happen.

The proper way to handle that is to call the owner of the data (the original store) and let them know that the check arrived to the wrong store. At this point, the data owner can decide how to handle that new information, apply whatever workflows need to be done, etc.

I intentionally used what looks like a toy example, because it is easy to get bogged down in the details. But in any distributed system, there are local processes that happen which can be quite important. If you go ahead and update their information behind their back, you are guaranteed to break something. And I haven’t even began to talk about the chance for conflicts… of course.

How to really fail a coding interview

time to read 1 min | 154 words

Our current interview question is from this post. We use that between the phone interview and the actual interview to get a feel about a candidate abilities. You can learn a surprising amount of information from even small amount of code.

Note that one of the primary goals of such a question isn’t to tell you “You should really hire this candidate” but to tell you that “You really shouldn’t”.  To clarify, this is a “do it on your own, and you got the whole internet at your disposal” kind of question. Typically we give a week or so to answer this.

Sometimes we get a very clear signal from the code, like in the case of this code:


But I think the crowning glory was this code:

I picked two of the worst offenders, but there were more. Some things I can sort of let slide, and some things I’ll just say no to.

DotNetRocks show on RavenDB with Kamran Ayub

time to read 1 min | 107 words

Kamran Ayub did a great DotNetRocks show about RavenDB 4.0. Kamran is also being the RavenDB 4.0 course on PluralSight, so he knows his stuff.

I got to say, it is… strange to listen to a podcast about RavenDB. I found myself nodding along quite often and the outside perspective is pretty awesome.

Kamran also tested the same application on RavenDB 3.5 and RavenDB 4.0, seeing 20x performance improvement. Best quote from the show as far as I’m concerned:

So fast you aren’t sure it actually worked.

Kamran also have a follow up post with some numbers and more details here.

Listen to the show here.

RavenDB online bootcamp is now updated to 4.0

time to read 1 min | 136 words

imageIn addition to the book and the documentation, we are also working on making it more accessible to get started with RavenDB. The RavenDB Bootcamp is a self directed course meant to give you an easy way to start using RavenDB.

This is a guided tour, walking you through the fundamentals of getting RavneDB up and running, how to put data in and query it, how you can use indexing and MapReduce. These are short lessons, providing practical experience and guidance on how to start using RavenDB.

You can also register to get a lesson a day.

This is now updated to RavenDB 4.0, smoothing the learning curve and making it even simpler to get started.

Performance optimization starts at the business process level

time to read 3 min | 447 words

I had an interesting discussion today about optimization strategies. This is a fascinating topic, and quite rewarding as well. Mostly because it is so easy to see your progress. You have a number, and if it goes in the right direction, you feel awesome.

Part of the discussion was how the use of a certain technical solution was able to speed up a business process significantly. What really interested me was that I felt that there was a lot of performance still left on the table because of the limited nature of the discussion.

It is easier if we do this with a concrete example. Imagine that we have a business process such as underwriting a loan. You can see how that looks like below:

image

This process is setup so there are a series of checks that the loan must go through before approval. The lender wants to speed up the process as much as possible. In the case we discussed, the operations performed were mostly in the speed in which we can move the loan application from one stage to the next. The idea is that we keep all parts of the system as busy as possible and maximize throughput. The problem is that there is only so much that we can do with a serial process like this.

From the point of view of the people working on the system, it is obvious that you need to run the checks in this order. There is no point in doing anything else. If there is not enough collateral, why should we run the legal status check, for example?

Well, what if we changed things around?

image

In this mode, we run all the check concurrently. If most of our lenders are valid, this means that we can significantly speedup the time for loan approval. Even if there is a significant number of people who are going to be denied, the question now becomes whatever it is worth the trouble (and expense) to run the additional checks.

At this point, it is a business decision, because we are mucking about with the business process itself.  Don’t get too attached to this example, I chose it because it is simple and obvious to see the difference in the business processes.

The point is that not thinking about this from that level completely block you from what is a very powerful optimization. There is only so much you can do within the box, but if you can get a different box…

RavenDB 4.1 FeaturesCounting my counters

time to read 3 min | 501 words

imageDocuments are awesome, they allow you to model your data in a very natural way. At the same time, there are certain things that just don’t fit into the document model.

Consider the simple case of counting. This seems like it would be very obvious, right? As simple as 1+1. However, you need to also consider concurrency and distribution. Look at the image on the right. What you can see there is a document describing a software release. In addition to tracking the features that are going into the release, we also want to count various statistics about the release. In this example, you can see how many times a release was downloaded, how many times it was rated, etc.

I’ll admit that the stars rating is a bit cheesy, but it looks good and actually test that we have good Unicode support Smile.

Except for a slightly nicer way to show numbers on the screen, what does this feature gives you? It means that RavenDB now natively understand how to count things. This means that you can increment (or decrement) a value without modifying the whole document. It also means that RavenDB will be able to automatically handle concurrency on the counters, even when running in a distributed system. This make this feature suitable for cases where you:

  • want to increment a value
  • don’t care (and usually explicitly desire) concurrency
  • may need to handle very large number of operations

The case of the download counter or the rating votes is a classic example. Two separate clients may increment either of these values at the same time a third user is modifying the parent document. All of that is handled by RavenDB, the data is updated, distributed across the cluster and the final counter values are tallied.

Counters cannot cause conflicts and the only operation that you are allowed to do to them is to increment / decrement the counter value. This is a cumulative operation, which means that we can easily handle concurrency at the local node or cluster level by merging the values.

Other operations (deleting a counter, deleting the parent document) are of course non cumulative, but are much rarer and don’t typically need any sort of cooperative concurrency.

Counters are not standalone values but are strongly associated with their owning document. Much like the attachments feature, this means that you have a structured way to add additional data types to you documents. Use counters to, well… count. Use attachments to store binary data, etc. You are going to see a lot more of this in the future, since there are a few things in the pipeline that we are already planning to add.

You can use counters as a single operation (incrementing a value) or in a batch (incrementing multiple values, or even modifying counters and documents together). In all cases, the operation is transactional and will ensure full ACIDity.

FUTURE POSTS

  1. Distributed compare-exchange operations with RavenDB - 13 hours from now
  2. I WILL have order: How Lucene sorts query results - about one day from now
  3. I WILL have order: How Noise sorts query results - 5 days from now

There are posts all the way to May 28, 2018

RECENT SERIES

  1. RavenDB 4.1 features (4):
    22 May 2018 - Highlighting
  2. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  3. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
  4. Challenge (52):
    03 Apr 2018 - The invisible concurrency bug–Answer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats