Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,630 | Comments: 48,356

filter by tags archive

Daisy chaining data flow in RavenDB

time to read 4 min | 685 words

I have talked before about RavenDB’s MapReduce indexes and their ability to output results to a collection as well as RavenDb’s ETL processes and how we can use them to push some data to another database (a RavenDB database or a relational one).

Bringing these two features together can be surprisingly useful when you start talking about global distributed processing. A concrete example might make this easier to understand.

Imagine a shoe store (we’ll go with Gary’s Shoes) that needs to track sales across a large number of locations. Because sales must be processed regardless of the connection status, each store hosts a RavenDB server to record its sales. Here is the geographic distribution of the stores:

img06

To properly manage this chain of stores, we need to be able to look at data across all stores. One way of doing this is to set up external replication from each store location to a central server. This way, all the data is aggregated into a single location. In most cases, this would be the natural thing to do. In fact, you would probably want two-way replication of most of the data so you could figure out if a given store has a specific shoe in stock by just looking at the local copy of its inventory. But for the purpose of this discussion, we’ll assume that there are enough shoe sales that we don’t actually want to have all the sales replicated.

We just want some aggregated data. But we want this data aggregated across all stores, not just at one individual store. Here’s how we can handle this: we’ll define an index that would aggregate the sales across the dimensions that we care about (model, date, demographic, etc.). This index can answer the kind of queries we want, but it is defined on the database for each store so it can only provide information about local sales, not what happens across all the stores. Let’s fix that. We’ll change the index to have an output collection. This will cause it to write all its output as documents to a dedicated collection.

Why does this matter? These documents will be written to solely by the index, but given that they are documents, they obey all the usual rules and can be acted upon like any other document. In particular, this means that we can apply an ETL process to them. Here is what this ETL script would look like.

img07

The script sends the aggregated sales (the collection generated by the MapReduce index) to a central server. Note that we also added some static fields that will be helpful on the remote server so as to be able to tell which store each aggregated sale came from. At the central server, you can work with these aggregated sales documents to each store’s details, or you can aggregate them again to see the state across the entire chain.

The nice things about this approach are the combination of features and their end result. At the local level, you have independent servers that can work seamlessly with an unreliable network. They also give store managers a good overview of their local states and what is going on inside their own stores.

At the same time, across the entire chain, we have ETL processes that will update the central server with details about sales statuses on an ongoing basis. If there is a network failure, there will be no interruption in service (except that the sales details for a particular store will obviously not be up to date). When the network issue is resolved, the central server will accept all the missing data and update its reports.

The entire process relies entirely on features that already exist in RavenDB and are easily accessible. The end result is a distributed, highly reliable and fault tolerant MapReduce process that gives you aggregated view of sales across the entire chain with very little cost.

RavenDB 4.1 FeaturesHighlighting

time to read 2 min | 238 words

This s actually an old feature, that didn’t make the cut to enter 4.0. This is now back, and it is roaring. This is the kind of feature that is useful if you are utilizing RavenDB’s search capabilities. Let us assume that you want to search for something, but instead of querying for “give me all the active users” you want to actually… search. For example, you want to search for all employees with a BA in their bio. However, you don’t want to just get the matches, you want to show the user why this was matches.

That is the problem that highlighting is meant to solve. Consider the following query:

image

Which returns the following results:

image

Why did we get this particular employees?  Let’s find out:

image

Now we are asking the server to highlight for us the reason for the match. You can see this in the studio directly, in the Highlight tab:

image

Using this approach, you can enrich the search result and provide nicer experience for your users.

Inside RavenDB 4.0Book update

time to read 1 min | 66 words

Just to let you know, the book is pretty much edited, that means that you won’t have to suffer through my horrible sentence structure.

You can read this here.

What remains to be done now is for me to go over the book again, verify that there aren’t any issues, and we are done.

In other words, we are now “Done, Done” in the “Done, Done, Done” scale.

DotNetRocks show on RavenDB with Kamran Ayub

time to read 1 min | 107 words

Kamran Ayub did a great DotNetRocks show about RavenDB 4.0. Kamran is also being the RavenDB 4.0 course on PluralSight, so he knows his stuff.

I got to say, it is… strange to listen to a podcast about RavenDB. I found myself nodding along quite often and the outside perspective is pretty awesome.

Kamran also tested the same application on RavenDB 3.5 and RavenDB 4.0, seeing 20x performance improvement. Best quote from the show as far as I’m concerned:

So fast you aren’t sure it actually worked.

Kamran also have a follow up post with some numbers and more details here.

Listen to the show here.

RavenDB online bootcamp is now updated to 4.0

time to read 1 min | 136 words

imageIn addition to the book and the documentation, we are also working on making it more accessible to get started with RavenDB. The RavenDB Bootcamp is a self directed course meant to give you an easy way to start using RavenDB.

This is a guided tour, walking you through the fundamentals of getting RavneDB up and running, how to put data in and query it, how you can use indexing and MapReduce. These are short lessons, providing practical experience and guidance on how to start using RavenDB.

You can also register to get a lesson a day.

This is now updated to RavenDB 4.0, smoothing the learning curve and making it even simpler to get started.

RavenDB 4.1 FeaturesCounting my counters

time to read 3 min | 501 words

imageDocuments are awesome, they allow you to model your data in a very natural way. At the same time, there are certain things that just don’t fit into the document model.

Consider the simple case of counting. This seems like it would be very obvious, right? As simple as 1+1. However, you need to also consider concurrency and distribution. Look at the image on the right. What you can see there is a document describing a software release. In addition to tracking the features that are going into the release, we also want to count various statistics about the release. In this example, you can see how many times a release was downloaded, how many times it was rated, etc.

I’ll admit that the stars rating is a bit cheesy, but it looks good and actually test that we have good Unicode support Smile.

Except for a slightly nicer way to show numbers on the screen, what does this feature gives you? It means that RavenDB now natively understand how to count things. This means that you can increment (or decrement) a value without modifying the whole document. It also means that RavenDB will be able to automatically handle concurrency on the counters, even when running in a distributed system. This make this feature suitable for cases where you:

  • want to increment a value
  • don’t care (and usually explicitly desire) concurrency
  • may need to handle very large number of operations

The case of the download counter or the rating votes is a classic example. Two separate clients may increment either of these values at the same time a third user is modifying the parent document. All of that is handled by RavenDB, the data is updated, distributed across the cluster and the final counter values are tallied.

Counters cannot cause conflicts and the only operation that you are allowed to do to them is to increment / decrement the counter value. This is a cumulative operation, which means that we can easily handle concurrency at the local node or cluster level by merging the values.

Other operations (deleting a counter, deleting the parent document) are of course non cumulative, but are much rarer and don’t typically need any sort of cooperative concurrency.

Counters are not standalone values but are strongly associated with their owning document. Much like the attachments feature, this means that you have a structured way to add additional data types to you documents. Use counters to, well… count. Use attachments to store binary data, etc. You are going to see a lot more of this in the future, since there are a few things in the pipeline that we are already planning to add.

You can use counters as a single operation (incrementing a value) or in a batch (incrementing multiple values, or even modifying counters and documents together). In all cases, the operation is transactional and will ensure full ACIDity.

RavenDB 4.1 featuresJavaScript Indexes

time to read 3 min | 600 words

Note: This feature is an experimental one. It will be included in 4.1, but it will be behind an experimental feature flag. It is possible that this will change before full inclusion in the product.

RavenDB now supports multiple operating systems and we spend a lot of effort to bring RavenDB client APIs to more platforms. C#, JVM and Python are already done, Go, Node.JS and Ruby are in various beta stages. One of the things that this brought up was our indexing structure. Right now, if you want to define a custom index in RavenDB, you use C# Linq syntax to do so. When RavenDB was primarily focused on .NET, that was a perfectly fine decision. However, as we are pushing for more platforms, we wanted to avoid forcing users to learn the C# syntax when they create indexes.

With no further ado, here is a JavaScript index in RavenDB 4.1:

As you can see, this is pretty simple translation between the two. It does make certain set of operations easier, since the JavaScript option is a lot more imperative. Consider the case of this more complex index:

You can see here the interplay of a few features. First, instead of just selecting a value to index, we can use a full fledged function. That means that you can run your complex computation during index more easily. Features such as loading related documents are there, and you can see how we use reduce to aggregate information as part of the indexing function.

JavaScript’s dynamic nature gives us a a lot of flexibility. If you want to index fields dynamically, just do so, as you can see here:

MapReduce indexes work along the same concept. Here is a good example:

The indexing syntax is the only thing that changed. The rest is all the same. All the capabilities and features that you are used to are still there.

JavaScript is used extensively in RavenDB, not surprisingly. That is how you patch documents, do projections and manage subscription. It is also a very natural language to handle JSON documents. I think that it is a pretty fair to assume that anyone who uses RavenDB will have at least a passing familiarity with JavaScript, so that make it easier to get how indexing work.

There is also the security aspect. JavaScript is much easier to control and handle in an embedded fashion. The C# indexes are allowing users to write their own code that RavenDB will run. That code can, in theory, do anything. This is why index creation is an admin level operation. With JavaScript indexes, we can allow users to run their computation without worrying that they will do something that they shouldn’t. Hence, the access level required for creating JavaScript indexes is much lower.

Using JavaScript for indexing does have some performance implications. The C# code is faster, generally, but not much faster. The indexing function isn’t where we usually spend a lot of time when indexing, so adding a bit of additional work there (interpreting JavaScript) doesn’t hurt us too badly. We are able to get to speeds of over 80,000 documents / second using JavaScript indexes, which should be sufficient. The C# indexes aren’t going anywhere, of course. They are still there and can provide additional flexibility / power as needed.

Another feature that might be very useful is the ability to attach additional sources to an index. For example, you may really like a sum using lodash. You can add the lodash.js file as an additional file to an index, and that would expose the library to the indexing functions.

Inside RavenDB 4.0The book is done

time to read 1 min | 166 words

The Inside RavenDB 4.0 book is done. That means that all of the content is there and it covers every aspect of RavenDB.

imageThere is still quite a bit to be done (editing and re-reads, mostly), but the the hardest part (for me) is done. I got it all out of my head and into a format where others can look at this.

You can read the draft release here.

The book cover:

  1. Welcome to RavenDB
  2. Setting up RavenDB
  3. Document modeling
  4. Client API usage
  5. Batch processing with subscriptions
  6. Distributed RavenDB Clusters
  7. Scaling RavenDB
  8. Sharing data and ETL processes
  9. Querying
  10. Indexing in RavenDB
  11. Map Reduce and aggregations
  12. Managing and understanding indexes
  13. Securing your RavenDB cluster
  14. Encrypting your data
  15. Production deployments
  16. Monitoring and troubleshooting
  17. Backup and restore
  18. Operational recipes

A total of 18 chapters and 570 pages so far.

I’m still missing an index, intro and a bunch of stuff, but these are now more technical in nature. No need for creative juices to pump to get them working.

Feedback is welcome, I would really appreciate it. You can read it here.

RavenDB 4.1 featuresSQL Migration Wizard

time to read 2 min | 234 words

One of the new features coming up in 4.1 is the SQL Migration Wizard. It’s purpose is very simple, to get you started faster and with less work. In many cases, when you start using RavenDB for the first time, you’ll need to first put some data in to play with. We have the sample data which is great to start with, but you’ll want to use you own data and work with that. This is what the SQL Migration Wizard is for.

You start it by pointing it at your existing SQL database, like so:

image

The wizard will analyze your schema and suggest a document model based on that. You can see how this looks like here:

image

In this case, you can see that we are taking a linked table (employee_privileges) and turning that into an embedded collection.  You also have additional options and you’ll be able to customize it all.

The point of the migration wizard is not so much to actually do the real production migration but to make it easier for you to start playing around with RavenDB with your own data. This way, the first step of “what do I want to use it for” is much easier.

Roadmap for RavenDB 4.1

time to read 2 min | 227 words

imageWe are gearing up to start work on the next release of RavenDB, following the 4.0 release. I thought this would be a great time to talk about what are the kind of things that we want to do there. This is going to be a minor point release, so we aren’t going to shake things up.

The current plan is to release 4.1 about 6 months after the 4.0 release, in the July 2018 timeframe.

Instead, we are planning to focus on the following areas:

  • Performance
    • Moving to .NET Core 2.1 for the performance advantages this gives us.
    • Start to take advantage of the new features such as Span<T>, etc in .NET Core 2.1.
    • Updating the JavaScript engine for better query / patch performance.
  • Wild card certificates via Let’s Encrypt, which can simplify cluster management when RavenDB generates the certificates.
  • Restoring highlighting support

We are also going to introduce the notion of experimental features. That is, features that are ready from our perspective but still need some time out in the sun getting experience in production. For 4.1, we have the following features slated for experimental inclusion:

  • JavaScript indexes
  • Distributed counters
  • SQL Migration wizard

I have a dedicated post to talk about each of these topics, because I cannot do them justice in just a few words.

FUTURE POSTS

  1. Distributed compare-exchange operations with RavenDB - 10 hours from now
  2. I WILL have order: How Lucene sorts query results - about one day from now
  3. I WILL have order: How Noise sorts query results - 4 days from now

There are posts all the way to May 28, 2018

RECENT SERIES

  1. RavenDB 4.1 features (4):
    22 May 2018 - Highlighting
  2. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  3. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
  4. Challenge (52):
    03 Apr 2018 - The invisible concurrency bug–Answer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats