Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,627 | Comments: 48,351

filter by tags archive

Data ownership: The story of an invoice

time to read 3 min | 478 words

imageLet’s talk about Gary, and Gary’s Shoes. Gary runs a chain of shoes stores across the nation. As part of refreshing their infrastructure, Gary want to update all the software across the entire chain. The idea is to have a unified billing, inventory, sales and time tracking for the entire chain.

Gary doesn’t spend a lot of time on this (after all, he has to sell shoes), he just installed a sync service between all the stores and HQ to sync up all the data. Well, I call in sync service. What it actually turn out to be is that the unified system is a set of Excel files on a shared DropBox folder.

Feel free to go and wash your face, have a drink, take Xanax. I know this might be a shock to see something like this.

Surprisingly enough, this isn’t the topic of my post. Instead, I want to talk about data ownership here.

Imagine that one of Gary’s stores in Chicago sold a bunch of shoes, then issued an invoice to the customer. They dutifully recorded the order in the Orders.xlsx file with the status “Pending Payment”.

That customer, however, accidently sent the check to the wrong store. No biggie, right?  The clerk at the second store can just go ahead and update the order in the shared system, marking it as “Paid in full”.

As it turns out, this is a problem. And the easiest way to explain why is data ownership. The owner of this particular record is the original store. You might say that this doesn’t matter, after all, the change happened in the same system. But the problem is that this is almost always not the case.

In addition to the operation “system” that you can see on the right, there are other things. The store manager still have a PostIt note to call that customer and ask about the missing payment. The invoice that was generated need to be closed, etc. Just updating it in the system isn’t going to cause all of that to happen.

The proper way to handle that is to call the owner of the data (the original store) and let them know that the check arrived to the wrong store. At this point, the data owner can decide how to handle that new information, apply whatever workflows need to be done, etc.

I intentionally used what looks like a toy example, because it is easy to get bogged down in the details. But in any distributed system, there are local processes that happen which can be quite important. If you go ahead and update their information behind their back, you are guaranteed to break something. And I haven’t even began to talk about the chance for conflicts… of course.

How to really fail a coding interview

time to read 1 min | 154 words

Our current interview question is from this post. We use that between the phone interview and the actual interview to get a feel about a candidate abilities. You can learn a surprising amount of information from even small amount of code.

Note that one of the primary goals of such a question isn’t to tell you “You should really hire this candidate” but to tell you that “You really shouldn’t”.  To clarify, this is a “do it on your own, and you got the whole internet at your disposal” kind of question. Typically we give a week or so to answer this.

Sometimes we get a very clear signal from the code, like in the case of this code:


But I think the crowning glory was this code:

I picked two of the worst offenders, but there were more. Some things I can sort of let slide, and some things I’ll just say no to.

DotNetRocks show on RavenDB with Kamran Ayub

time to read 1 min | 107 words

Kamran Ayub did a great DotNetRocks show about RavenDB 4.0. Kamran is also being the RavenDB 4.0 course on PluralSight, so he knows his stuff.

I got to say, it is… strange to listen to a podcast about RavenDB. I found myself nodding along quite often and the outside perspective is pretty awesome.

Kamran also tested the same application on RavenDB 3.5 and RavenDB 4.0, seeing 20x performance improvement. Best quote from the show as far as I’m concerned:

So fast you aren’t sure it actually worked.

Kamran also have a follow up post with some numbers and more details here.

Listen to the show here.

RavenDB online bootcamp is now updated to 4.0

time to read 1 min | 136 words

imageIn addition to the book and the documentation, we are also working on making it more accessible to get started with RavenDB. The RavenDB Bootcamp is a self directed course meant to give you an easy way to start using RavenDB.

This is a guided tour, walking you through the fundamentals of getting RavneDB up and running, how to put data in and query it, how you can use indexing and MapReduce. These are short lessons, providing practical experience and guidance on how to start using RavenDB.

You can also register to get a lesson a day.

This is now updated to RavenDB 4.0, smoothing the learning curve and making it even simpler to get started.

Performance optimization starts at the business process level

time to read 3 min | 447 words

I had an interesting discussion today about optimization strategies. This is a fascinating topic, and quite rewarding as well. Mostly because it is so easy to see your progress. You have a number, and if it goes in the right direction, you feel awesome.

Part of the discussion was how the use of a certain technical solution was able to speed up a business process significantly. What really interested me was that I felt that there was a lot of performance still left on the table because of the limited nature of the discussion.

It is easier if we do this with a concrete example. Imagine that we have a business process such as underwriting a loan. You can see how that looks like below:

image

This process is setup so there are a series of checks that the loan must go through before approval. The lender wants to speed up the process as much as possible. In the case we discussed, the operations performed were mostly in the speed in which we can move the loan application from one stage to the next. The idea is that we keep all parts of the system as busy as possible and maximize throughput. The problem is that there is only so much that we can do with a serial process like this.

From the point of view of the people working on the system, it is obvious that you need to run the checks in this order. There is no point in doing anything else. If there is not enough collateral, why should we run the legal status check, for example?

Well, what if we changed things around?

image

In this mode, we run all the check concurrently. If most of our lenders are valid, this means that we can significantly speedup the time for loan approval. Even if there is a significant number of people who are going to be denied, the question now becomes whatever it is worth the trouble (and expense) to run the additional checks.

At this point, it is a business decision, because we are mucking about with the business process itself.  Don’t get too attached to this example, I chose it because it is simple and obvious to see the difference in the business processes.

The point is that not thinking about this from that level completely block you from what is a very powerful optimization. There is only so much you can do within the box, but if you can get a different box…

RavenDB 4.1 FeaturesCounting my counters

time to read 3 min | 501 words

imageDocuments are awesome, they allow you to model your data in a very natural way. At the same time, there are certain things that just don’t fit into the document model.

Consider the simple case of counting. This seems like it would be very obvious, right? As simple as 1+1. However, you need to also consider concurrency and distribution. Look at the image on the right. What you can see there is a document describing a software release. In addition to tracking the features that are going into the release, we also want to count various statistics about the release. In this example, you can see how many times a release was downloaded, how many times it was rated, etc.

I’ll admit that the stars rating is a bit cheesy, but it looks good and actually test that we have good Unicode support Smile.

Except for a slightly nicer way to show numbers on the screen, what does this feature gives you? It means that RavenDB now natively understand how to count things. This means that you can increment (or decrement) a value without modifying the whole document. It also means that RavenDB will be able to automatically handle concurrency on the counters, even when running in a distributed system. This make this feature suitable for cases where you:

  • want to increment a value
  • don’t care (and usually explicitly desire) concurrency
  • may need to handle very large number of operations

The case of the download counter or the rating votes is a classic example. Two separate clients may increment either of these values at the same time a third user is modifying the parent document. All of that is handled by RavenDB, the data is updated, distributed across the cluster and the final counter values are tallied.

Counters cannot cause conflicts and the only operation that you are allowed to do to them is to increment / decrement the counter value. This is a cumulative operation, which means that we can easily handle concurrency at the local node or cluster level by merging the values.

Other operations (deleting a counter, deleting the parent document) are of course non cumulative, but are much rarer and don’t typically need any sort of cooperative concurrency.

Counters are not standalone values but are strongly associated with their owning document. Much like the attachments feature, this means that you have a structured way to add additional data types to you documents. Use counters to, well… count. Use attachments to store binary data, etc. You are going to see a lot more of this in the future, since there are a few things in the pipeline that we are already planning to add.

You can use counters as a single operation (incrementing a value) or in a batch (incrementing multiple values, or even modifying counters and documents together). In all cases, the operation is transactional and will ensure full ACIDity.

RavenDB 4.1 featuresJavaScript Indexes

time to read 3 min | 600 words

Note: This feature is an experimental one. It will be included in 4.1, but it will be behind an experimental feature flag. It is possible that this will change before full inclusion in the product.

RavenDB now supports multiple operating systems and we spend a lot of effort to bring RavenDB client APIs to more platforms. C#, JVM and Python are already done, Go, Node.JS and Ruby are in various beta stages. One of the things that this brought up was our indexing structure. Right now, if you want to define a custom index in RavenDB, you use C# Linq syntax to do so. When RavenDB was primarily focused on .NET, that was a perfectly fine decision. However, as we are pushing for more platforms, we wanted to avoid forcing users to learn the C# syntax when they create indexes.

With no further ado, here is a JavaScript index in RavenDB 4.1:

As you can see, this is pretty simple translation between the two. It does make certain set of operations easier, since the JavaScript option is a lot more imperative. Consider the case of this more complex index:

You can see here the interplay of a few features. First, instead of just selecting a value to index, we can use a full fledged function. That means that you can run your complex computation during index more easily. Features such as loading related documents are there, and you can see how we use reduce to aggregate information as part of the indexing function.

JavaScript’s dynamic nature gives us a a lot of flexibility. If you want to index fields dynamically, just do so, as you can see here:

MapReduce indexes work along the same concept. Here is a good example:

The indexing syntax is the only thing that changed. The rest is all the same. All the capabilities and features that you are used to are still there.

JavaScript is used extensively in RavenDB, not surprisingly. That is how you patch documents, do projections and manage subscription. It is also a very natural language to handle JSON documents. I think that it is a pretty fair to assume that anyone who uses RavenDB will have at least a passing familiarity with JavaScript, so that make it easier to get how indexing work.

There is also the security aspect. JavaScript is much easier to control and handle in an embedded fashion. The C# indexes are allowing users to write their own code that RavenDB will run. That code can, in theory, do anything. This is why index creation is an admin level operation. With JavaScript indexes, we can allow users to run their computation without worrying that they will do something that they shouldn’t. Hence, the access level required for creating JavaScript indexes is much lower.

Using JavaScript for indexing does have some performance implications. The C# code is faster, generally, but not much faster. The indexing function isn’t where we usually spend a lot of time when indexing, so adding a bit of additional work there (interpreting JavaScript) doesn’t hurt us too badly. We are able to get to speeds of over 80,000 documents / second using JavaScript indexes, which should be sufficient. The C# indexes aren’t going anywhere, of course. They are still there and can provide additional flexibility / power as needed.

Another feature that might be very useful is the ability to attach additional sources to an index. For example, you may really like a sum using lodash. You can add the lodash.js file as an additional file to an index, and that would expose the library to the indexing functions.

The project is free, but I’ll charge you for reporting bugs

time to read 4 min | 607 words

The impetus for this post is this twit:

I cannot express enough how much I object to this statement. I absolutely understand the reasoning, by the way. Drive by bug reports can be frustrating. Users of open source projects can have unreasonable expectations. Any open source project maintainer can tell you about posts with “Your code SUCKS” or people calling your phone at odd hours with hard to understand accents (and I’m not really one who can complain here) and demanding that you fix stuff. This sucks, absolutely. And in a popular project, you might run into a lot of “user error” bug reports, or even outright “fix my code” issues.

Nevertheless, I think that this approach focus on one side of the issue, how much burden it puts on the maintainers of the project. What it misses is that there is really valuable information contained in the bug report. It might be something that the software is not capable of doing, a wrong usage (no, you cannot use this class concurrently) or a real bug. Regardless, a bug report is valuable in and of itself. Someone put the time to actually use your software, identified a problem (for their use case) and reported it.

That doesn’t mean that you are in any way (if you are OSS project) obliged to fix this issue. In fact, I believe that even if you just threw the code on GitHub because you didn’t have anything else to do with it, bugs are still valuable.

Bug reports involve efforts, and if you have a live OSS project, you want to respect that and answer those bugs. At a minimum, they tell you about what users are doing with your software. That might not motivate you to fix those issues, but it is good to know anyway.

Sometimes you will get a good bug, either about an “obvious” missing functionality that you can add or “wow, we have that” that is critical to fix. Bugs that never impact you are also interesting. It might be a race condition that you were lucky to never hit, or a silent miscalculation that was never noticed, data corruption that will hit you in the future or even a security issue. Regardless, it is interesting.

All of the above doesn’t mean that you have to do something about any of these. In fact, even if you don’t ever intend to go back to this project, bugs are very useful. And not for you, for other people. If someone comes to your project and see posted bugs, they can figure out what not to do. They can learn about possible workarounds provided and confirm that this is a bug / limitation and they aren’t going crazy.

As for how to actually handle such bugs in an OSS project, if you aren’t interested in fixing a bug, it is perfectly fine to not to. A free OSS project can absolutely have policies on what are acceptable bugs, and closing the “fix my code” is a good policy in general.

For real issues in the code that you aren’t interested in fixing, it is okay to say: “Send me a pull request for this”.

For nasty replies, I found that: “I’ll be happy to refund your money” usually puts things in perspective.

Inside RavenDB 4.0The book is done

time to read 1 min | 166 words

The Inside RavenDB 4.0 book is done. That means that all of the content is there and it covers every aspect of RavenDB.

imageThere is still quite a bit to be done (editing and re-reads, mostly), but the the hardest part (for me) is done. I got it all out of my head and into a format where others can look at this.

You can read the draft release here.

The book cover:

  1. Welcome to RavenDB
  2. Setting up RavenDB
  3. Document modeling
  4. Client API usage
  5. Batch processing with subscriptions
  6. Distributed RavenDB Clusters
  7. Scaling RavenDB
  8. Sharing data and ETL processes
  9. Querying
  10. Indexing in RavenDB
  11. Map Reduce and aggregations
  12. Managing and understanding indexes
  13. Securing your RavenDB cluster
  14. Encrypting your data
  15. Production deployments
  16. Monitoring and troubleshooting
  17. Backup and restore
  18. Operational recipes

A total of 18 chapters and 570 pages so far.

I’m still missing an index, intro and a bunch of stuff, but these are now more technical in nature. No need for creative juices to pump to get them working.

Feedback is welcome, I would really appreciate it. You can read it here.

RavenDB 4.1 featuresSQL Migration Wizard

time to read 2 min | 234 words

One of the new features coming up in 4.1 is the SQL Migration Wizard. It’s purpose is very simple, to get you started faster and with less work. In many cases, when you start using RavenDB for the first time, you’ll need to first put some data in to play with. We have the sample data which is great to start with, but you’ll want to use you own data and work with that. This is what the SQL Migration Wizard is for.

You start it by pointing it at your existing SQL database, like so:

image

The wizard will analyze your schema and suggest a document model based on that. You can see how this looks like here:

image

In this case, you can see that we are taking a linked table (employee_privileges) and turning that into an embedded collection.  You also have additional options and you’ll be able to customize it all.

The point of the migration wizard is not so much to actually do the real production migration but to make it easier for you to start playing around with RavenDB with your own data. This way, the first step of “what do I want to use it for” is much easier.

FUTURE POSTS

  1. Code that? It is cheaper to get a human - 7 hours from now
  2. RavenDB 4.1 Features: Highlighting - about one day from now
  3. Daisy chaining data flow in RavenDB - 2 days from now
  4. Distributed compare-exchange operations with RavenDB - 3 days from now

There are posts all the way to May 24, 2018

RECENT SERIES

  1. RavenDB 4.1 features (4):
    11 May 2018 - Counting my counters
  2. Inside RavenDB 4.0 (9):
    08 May 2018 - The book is done
  3. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
  4. Challenge (52):
    03 Apr 2018 - The invisible concurrency bug–Answer
  5. RavenDB Security Review (5):
    27 Mar 2018 - Non-Constant Time Secret Comparison
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats