Ayende @ Rahien

Refunds available at head office

Interview questions: Large text viewer

I mentioned that we are looking for more people to work for us. This time, we are looking for WPF people for working on the profiler, as well another hush hush project that we’ll hopefully be able to reveal in December.

Because we are now looking for mostly UI people, that gives us a different set of challenges to deal with. How do I get a good candidate when my own WPF knowledge is limited to “Um.. dependency properties, man, that the bomb, man, yeah!”.

Add that to the fact that by the time people got an interview here, we want to be sure that they can code, that present an interesting problem. So we come up with questions like this one. Another question we have is the large text viewer.

We need a tool that can work with text file (logs) of huge size (1GB – 10 GB). We want to be able to open and search through such a file.

Nitpicker corner: I usually use this tool for that, the purpose of the question isn’t to actually to get such a tool, it is to see what kind of code the candidate writes.

We are looking for someone with a lot of skill in the UI side of things, so the large text file stuff is somewhat of a red herring, except that we want to see what they can do beyond just slap a few text boxes around.

Tags:

Published at

Originally posted at

Comments (7)

RavenDB 3.0–Release Plans

Well, I just committed the last feature for RavenDB 3.0, and all the tests are passing. What we are doing now is just working through the studio and running verification tests. Tomorrow or Sunday we are going to go live on our own systems with RavenDB 3.0. And shortly after that we’ll do an release candidate, followed by the actual release.

Tags:

Published at

Originally posted at

Comments (5)

Interview question: That annoying 3rd party service

We are still in a hiring mode. And today we have completed a new question for the candidates. The task itself is pretty simple, create a form for logging in or creating a new user.

Seems simple enough, I think. But there is a small catch. We’ll provide you the “backend” for this task, which you have to work with. The interface looks like this:

public interface IUserService
{
        void CreateNewUser(User u);
 
        User GetUser(string userId);
}

public class User
{
   public string Name {get;set;}
   public string Email {get;set;}
   public byte[] Sha1HashedPassword {get;set;}
}

The catch here is that we provide that as a dll that include the implementation for this, and as this is supposed to represent a 3rd party service, we made it behave like that. Sometimes the service will take a long time to run. Sometimes it will throw an error (ThisIsTuesdayException), sometime it will take a long time to run and throw an error, etc.

Now, the question is, what is it that I’m looking to learn from the candidate’s code?

Tags:

Published at

Originally posted at

Comments (33)

Hanselminutes Podcast: Inside RavenDB with Michael Yarichuk

Go and listen to this podcast:

Scott chats with Michael Yarichuk about RavenDB. Michael works with Ayende and the RavenDB team on their document database. Scott is trying to learn about document databases and Michael helps him along the path, exploring those computer science concepts that make document databases unique.

Tags:

Published at

Originally posted at

Azure DocumentDB

On Friday, Microsoft came up with Azure DocumentDB. You might say that I have a small interest in such things, so I headed over there to see what I can learn about this project.

Aside from being somewhat annoyed with the name, this seems to be a very different animal from RavenDB, and something that was built to serve a different niche. One of the things that we put first with RavenDB is ease of use, development and deployment for business applications. The ADB design appears to be built around a different goal, around very big datasets.

Nitpicker corner: Yes, I know this is a preview, and I know that they are going to be changes. And I repeat, I have no knowledge about this project beyond the documentation and several hours of playing with it.

That said, I do have a fair bit of experience in this area. So I feel that I can speak with confidence about the topic.

ADB is supposed to be an highly scalable system that store documents. So far, so good, I can certainly understand that need. But it has made drastically different design choices, some of which I feel very strongly about. I'll try to explore the issues that I have issues with, and contrast that with what you can do with RavenDB.

This post has two parts, the first talks about conceptual issues. The second talk about the currently published limits, and their implications for general use for ADB.

TLDR;

  • No sorting option, or a good paging story
  • SQL Injection, without any other alternative
  • Hard to deploy and to keep current with your codebase
  • Poor development story & no testing story
  • Poor client API
  • Lots of table scans
  • Limited queries and few optimization options
  • Single document transactions (from the client)
  • No cross collection transactions at all
  • Very small document sizes allowed

Also see the “What is this for?” section below.

For a document database platform that doesn’t have any of those issues, and run in Azure, see RavenHQ on Azure.

Transactions – ADB say that it has transactions, and for a very limited meaning of the word, I believe it means it. Transactions in ADB means a single document only can be saved with a guarantee it will either be saved or not. That is great, in the sense that at least you won’t have data corruption, but that isn’t really something that mean much. Even MongoDB can satisfy that bar.

Oh, sure, you can get actual transactions if you write JS code that run as a “stored procedure” inside ADB. This means that you can send data to the server and have your JS Stored Procedure make multiple operations in a single transaction. Which is just slightly better (although see my comments on those stored procedures later), but that is still limited to only operations inside the same collections.

A trivial example for transactions in a document database would be to add a new comment, and update the comment count. You cannot do that in ADB. Not in a single transaction. I don’t know about you, but most of the interesting use cases happen when you are working with multiple document types. Sure, you can put all your documents inside the same collection, but have fun trying to work with that in the long term.

In contrast, RavenDB fully support actual transactions that can span multiple documents (even on different collections, which I would never believe would be an accomplishment). RavenDB can even support DTC and transactions that spans multiple interactions with the server. You know, the kind of transactions you actually want to use. For more, see the documentation on RavenDB transactions.

Management – it honestly feels like someone missed all the points that made people want to ditch SQL in the first place. ADB has the concepts of triggers, user defined functions (more on that travesty later, when I discuss queries) and stored procedures. You can define them in JS, and you create something that looks like this:

image

Let me count the ways that this is going to cause problems for you.

  • Business logic in the database, because we haven’t learned anything about that in the past.
  • Code that you cannot run or test independently. Just debugging something like that is going to be hard.
  • No way to actually manage deployment or make sure that this code is in sync with the rest of your codebase.
  • Didn’t we already learn that triggers are a source for a lot of pain? Are they really necessary for you to do things?

Yes, you have a DB that is schema less, but those kind of things are actually important. They define what you can do with the database, and not having a good way to move those around, and most importantly, not having a way to tie them to the source control system you are using is going to be a giant PITA.

Sorry, that isn’t actually something that you can delay doing for later. You need a good development story, and as I see it, the entire development story all around here is just going to be hard. You would have to manually schlep things around between development and production. And that isn’t just about the SP or UDFs. There are a lot of settings that you’re going to have to deal with. For example, the configuration per collection, which you’ll want to make sure is the same (otherwise you get some very subtle and hard to understand bugs).

For that matter, there doesn’t seem to be a development story. You are probably expected to just run another ADB instance on Azure and use that. This means a lot of latency in development, and that also means that you can’t have separate databases per developer, which is a standard practice. This means having to write a lot of code just to manage those things, and you are right back again at the good old days of “who didn’t update the schema script” and failed deployments.

In contrast, RavenDB make is very easy to handle your indexes & transformers in your code and deploying them as a single step. That also means that they are versioned in the same place as your code, so you don’t have to worry about moving between dev & prod. We spent a lot of time thinking and working around this specific area, because this is a common pain point in relational databases, and we weren’t willing to accept that being the case in our database. For more information, please see the documentation about index management in RavenDB.

Indexing – there are several things here that bother me. By default, everything is indexed, and in the same transaction. This is a great decision, for a demo system. But in a real world system, the overhead of indexing everything is prohibitive, especially in a high write system. So ADB is allowing to specify the paths that you will include or exclude from indexing, as well as whatever indexing should be within the same transaction or lazy.

The problem with that is that those are per collection settings and there doesn’t appear to be any way to modify them after the fact. So you start running your system in production, realize that the cost of indexing is high, so you need to change the indexing strategy for a collection. The only way to do that is to create a new collection, with a new indexing strategy, move all the data there, then delete the old one. For even more fun, consider the case where you have a production and development environments. In production, you have a different indexing strategy then in development (where the ‘index everything’ mode is still on). That means that when you push things to production, your system will fail silently, because you won’t be indexing the fields you though were indexed.

This need re-iteration, the way this currently work, you start running with the default indexing option, which is expensive. As long as you don’t have any performance requirements (for example, during development), that is just fine. But when you actually have a lot of data there, or have a lot of writes, that is when you’ll figure out that those things need to be changed. At that point, you are are pretty much screwed, because you need to pull all the data out, create a new collection with the new indexing options, and write it all back. That is a horrible experience, especially because you’ll likely need to do that under pressure with users breathing down your necks and management complaining about the performance.

For that matter, indexing in general scares me. Again, I don’t actually have any knowledge of the internal operations, but there are a lot of stuff there that just doesn’t make sense. It looks like the precision of the indexes used are up to 3 characters (by default) per value. I’m guessing that this is done to reduce the amount of space used by the indexing, at least that is what the docs says. The problem is that when you do that, you do a lookup by the first 3 characters, then you have to do a flat search over all the other values. That is going to be causing problems.

It is also indicated that you cannot do any range searches except on numeric values. Which has interesting implications if you want to do searches on something like a date range, or time spans, an incredibly common operation.

In contrast, RavenDB indexes are always using the full value, so you are getting an O(logN) search behavior, and not a fallback to O(N) behavior. Range searches are possible on any value, numeric, date time, time span, string, etc. For more information, see the RavenDB documentation about searching with RavenDB.

Queries – Speaking of problems. Let me talk for a moment on ADB SQL. It looks nice on the surface, it is certainly would be familiar to most people. It is also contain a lot of hidden traps.

For example, the docs talk about being able to do joins, but you are only actually able to do “joins” into the sub documents, not into other collections, or even documents in the same collection. A query such as:

 

SELECT c.Name as CustomerName, o.Total, o.Date
FROM Orders o
JOIN Customers c ON c.Id = o.CustomerId

Can’t be executed on ADB. So the whole notion of “joins” is actually limited to what you can do in a single document and the sub documents it contains. That make it very limited.

The options for filtering (where clause) is also interesting. Mostly because of the wide range they allow. It is very easy to create queries that cannot be computed using indexes. That means that your query is now running table scans. Lots & lots of table scans. Sure, you don’t have tables, but O(N) is still O(N), and when N is large, as it is apparently the expected case here, you are going to be pretty much dead in the water.

Another thing that I can’t wrap my head around is the queries shown. There is no way to pass parameters to the query. None.  This appears to be the case because 30+ years of working with SQL has shown that there is absolutely no issue with putting user’s input directly into the query. And since complex queries require you to use the raw ADB SQL, you are pretty much have guaranteed that you’ll have SQL Injection attacks.

Sure, you might no get caught by Little Bobby Tables (you can’t modify data via SQL), but you are still exposed and can leak important data. This query works just fine, and will return all products:

SELECT * FROM Products p WHERE p.Name = "testing" OR 1 = 1 -- "

I’ll assume that you understand how I got there. This is a brand new database engine, but ADB is bringing very old issues back into the future. Not only that, we don’t have anyway around that. I guess you are going to have to write your on parameter scrubbing code, and make sure to use it everywhere.

In general, queries are limited. Severely limited, actually. Take a look at the following query:

SELECT * FROM Products p 
WHERE p.Type = "Beer"
AND p.Maker = "Guinness"
AND p.Discontinued = false 
AND p.Price > 10 AND p.Price < 100

You can’t run it in ADB. It is too complex to run. Note that this is about as trivial a query as you can get, in pretty much any reasonable business system.

Continuing on with the problems for business apps theme, there doesn’t appear to any good way to do things like paging. When you issue a query, you can specify the number of items to take and you can repeat a query by passing a continuation. But that doesn’t really help when you need to actually page with the user. So you show the data to the user, then want to go to the next page… you have to pass the continuation token all the way around, and hope that it will remain valid for the duration. For that matter, the current client API does paging at the server level, but it will fetch all the results for a query, even if it take it hours to do so.

There is no way to actually get the total number of items that match the query. So you can’t show the user something like: “You have 250 new emails”, nor can you show them “Page 1 … 50”.

Another troubling omission is the total lack of anything that would allow you to actually query your documents in a particular order. If I want to get the latest orders in descending order (or in fact, in any well defined order), I am out of luck. There is no way of doing that. This is a huge deal, because this isn’t just something that you can try papering over. This is a core functionality that you need in pretty much any application. And it is just not there. There is some indication that this is being worked on, but I’m surprised that this isn’t here already. Distributed sorting is a non trivial problem, of course, so I’ll reserve further judgment until I see what they have.

ADB’s queries are highly limited, so I expect a workaround for that is going to be to push functionality into the UDF. Note that UDF don’t have access to any context, so it can’t load additional documents. What it can do it utterly destroy any chance you’ll ever have for optimizing a query. The moment that a UDF is involved, you don’t really have a choice about how to execute a query, you pretty much have to go to a table scan. Maybe filtering some stuff based on the other filters in the query, but in many cases, that means that you’ll have to run your UDF over millions of records. Because UDFs are permitted to perform non pure operations (like the current time), you can’t even cache its values, or do anything smart around that. You’ll always have to execute the UDF, regardless of the amount of data you have to go through. I don’t expect that to perform very well.

In contrast, RavenDB was explicitly designed to give you both flexibility and performance in queries. There are no table scans in RavenDB, and complex queries are expected, encouraged and are handled properly. Queries across multiple documents (and in other collections) are possible, and quite easy to do. Common operations, like paging or sorting are part of the core functionality, and are both very easy to use and come with no additional costs. Complex things like full text search, spatial queries, facets and many more are right there for you to use.  For more information, see the RavenDB documentation about querying in RavenDB, spatial searches in RavenDB and how RavenDB actually index the data to allow complex operations.

Data types – ADB data types are the ones defined in the JSON spec. In particular, it doesn’t have native support for date times. The ADB documentation suggest that you’ll do custom serialization to handle that. Rendering things like asking: “Give me all the orders for this customer for 2014” very hard, leaving aside the issues of querying for orders in a particular month, which is not possible either, since you can only do range searches on numeric data. Dates, in particular, are a very complex topic, and not actually handling this in the database is going to open you up for a lot of issues down the road. And dates are kinda important type to have.

In contrast, RavenDB handles complex (including user defined) types in a well defined manner. And has full support for dates, operations on dates, etc. It seems silly to mention, to be fair, because it seems so basic to have that. For more information, you can read the documentation about dates in RavenDB.

Aggregation – this one is simple, you don’t have any. That means that you cannot get the total number of unread emails, or the total sum of orders per customer, or the maximum order per this month . This whole functionality just isn’t there.

In contrast, RavenDB has explicit support for counting the number of results for a query as well as map/reduce indexes. Those give you powerful aggregation framework, which execute the work in the background. When you query, you get the pre-computed results very quickly, without having to do any work at query time. For more information, you can read about Map/Reduce in RavenDB and dynamic aggregation queries.

Set operations – another easy one, it is just not there. You can do some operations in a stored procedure, but you have 5 seconds to run, and that is it. If you need to do something like: Split FullName to FirstName and LastName, get ready to write a lot of code, and wait for a long time for this to complete. For that matter, something as simple as “delete all inactive users” is very hard to do as well.

In contrast, RavenDB has explicit support for set based updates and deletes. You can specify a query that match a set of results that would either be deleted or patched using a JS script. For more operations, read the documentations about Set Based Operations.

Client API – this is still a preview, so that is somewhat unfair, but the client API is very primitive. Basically, it is a very thin wrapper around the REST API, and it does a poor job at that. The REST API support paging, but the C# client API does not, for example. There is no concept of unit of work, change tracking, client side behavior or anything at all that would actually make this work nicely. There is also an interesting design decision to go async for all operations except queries.

With queries, you actually issue an async REST call, but you are going to be waiting on that query synchronously. This is probably because of the IQueryable interface and its assumption that the query is sync. But that is a very bad thing to do in terms of mixing sync and async work. It is easy to get into problems such as deadlocks, self lock and just plain weirdness.

In contrast, RavenDB has a carefully designed client APIs (for .NET, JVM, etc), which fully expose the power of RavenDB. They have been designed to be intuitive, easy to use and guide you into the pit of success, RavenDB also have separate sync and async API, including fully async queries. For more information, read the documentation about the client API.

Self links – when issuing any operation whatsoever to the database, you have to use something call the object link, or self link. For example, in my test database, the Products collection link is: dbs/frETAA==/colls/frETANSmEAA=/

You have to use links like that whenever you make any operation what so ever. For fun, those are going to be unique per database, so creating a Products collection in another database would result in a different collection link. That means that I can’t just store them in configuration. So you’ll probably have to read them from the database every time you need to use them (maybe with some caching?). This is just silly. This is making it very hard to look at what is going on and see what the system is doing (for example, by watching what is going on in Fiddler).

In contrast, RavenDB applies human readable names whenever possible. For more information, see the documentation about the efforts to make sure that everything in RavenDB in human readable and easily debuggable. One such place is the id generation strategy.

Development and testing – in this day and age, people are connected to the internet through most of their day to day life. That doesn’t mean that they are always connected, or that you can actually rely on the network, or that the latency is acceptable. There is no current development story for ADB. No way to run your own database and develop while you are offline (on the train or at 30,000 feet in the air). That means that every call to ADB has to go over the internet, and that means, in turn, that there is no local development story at all. It means a lot more waiting from the point of view of the developer (also see next point), it means that there is just no testing story.

If you want to run code to test your ADB usage, you have to setup (and pay) a whole new ADB instance, have to make sure that it is setup exactly the same way as your production instance, and run it against that. It means that test not only have to go outside your process, but across the internet to a remote server. This pretty much kills the notion of fast tests.

In contrast, RavenDB has an excellent development and testing story. You don’t pay for development or CI instances, and you can run tests against RavenDB using an in memory mode embedded inside your process. This has been heavily optimize to allow fast running tests. We are developers, and we care to make other developers’ life easy. It shows. For more information, see the documentation about unit testing RavenDB.

Joins are for your code – because ADB doesn’t actually support joins beyond the document scope, or any other option like that, it means that if you want to do something trivial, like show a customer a list of their orders, you are actually going to have to do the join in your own code, not in the database. In fact, let us take a silly scenario, let us say that we want to show a list of new employees as well as their managers, so we can have a chat with them about how they are settling in.

If we were using SQL, we would be using something like this:

SELECT emp.Id as EmpId, emp.Name as EmpName, mngr.Id as ManagerId, mngr.Name as ManagerName
FROM Employees emp
JOIN Managers mngr where emp.ManagerId = mngr.Id
WHERE emp.JoinedAt > '2014-06-01'

That is pretty easy, right? How do you do something like that in ADB? Well, you start with the first query:

SELECT emp.Id as EmpId, emp.Name as EmpName, emp.ManagerId as ManagerId
FROM Employees emp
WHERE emp.JoinedAt > '2014-06-01'

And then, for each of the returned managers’ ids, we have to issue a separate query (ADB doesn’t have support for IN). This pattern of usage is called SELECT N+1, and it is a very well known anti pattern, even leaving aside the fact that you have to manually do the join in your own code, with all that this implies. This sort of operations will effectively kill the performance of any application, because you are very chatty with the database.

In contrast, RavenDB contains several ways to load related items. From including a related document to projecting it via a transformer, you can very easily and efficiently get all the data you need, in a single query to RavenDB. In fact, RavenDB applies a Safe By Default approach and limit the number of times you can call the server (configurable) to prevent just this case. We’ll error if you go over the budget of remote calls you are allowed to make. This gives you an early chance to catch performance problems. For more information, see the documentation about includes, transformers and  the Safe By Default approach practiced by RavenDB.

Limits - reading the limits for ADB makes for some head scratching. Yes, I know that we are talking about the preview mode only. I’m aware that you can ask to increase those limits. Nevertheless, those limits likely reflect real trade offs made in the system. So increasing those limits for a particular use case means that you’ll have to pay the price for that elsewhere.

For example, let us take this blog post as an example. It is over 22KB in size. But I can’t store this blog post in ADB. That is because documents are limited to 16KB in size. This is utterly ridiculous. I just checked a few of our databases, an common size for documents is 4 – 8 KB, this is true. But larger documents appear all the time. Even if you exclude blog posts as BLOB of text, we have order documents that have with  multiple order lines that are easily past that size. In our users, we see every document size possible, from hundreds of KB to several MB.

I reached out to Codealike, one of our customers, who were also featured in one of Azure’s case studies, to hear from them what their situation was. Out of 1.6 million documents in one of their databases, about 90% are in the 500Kb range.

I’m assuming that a large part of this limitation is the fact that by default, everything is indexed. You can’t index everything and have large documents and have reasonable performance. So this limit was introduced. I’m also assuming that there are other issues here (to be able to fit into pages? low level technical stuff?). Regardless, this is just utterly ridiculous low limit. The problem is that even raising this limit by x5 or x10, that is still not enough. And I’m assuming that they didn’t chose this limit out of thin air, that there is a technical reason for it.

Other issues is the number of stored procedure and UDF that you have available. You get 5 of each, and that is it. So you don’t get to actually express anything complex there. You also get to use only a single UDF per query, and to use a maximum of 3 AND / OR clauses in a query. I’m assuming that the reasoning here is that the more clauses you have, the more complex it is to run the query, especially in a distributed environment. So they put a hard limit on that.

Those limits together, along with not supporting sorting basically render ADB into an interesting curiosity, but not a real contender for a generally applicable database.

What is this for?

After going over the documentation, there is one thing that I couldn’t find. What is the primary use case for ADB? 

It looks more like a solution in search of a problem than the other way around. It appears that this is used by several MS systems to store 100s of TB of data, and process millions of queries. Sheer data size isn’t really interesting, we have customers that have multiple TB data. And millions of queries per day isn’t really something to brag about (10 million queries per day translate to about 115 queries per second, or about 20 – 30 queries per second per node).

What interests me is what sort of data do you put there? The small size limitation make it pretty much unsuitable for storing actual complex documents. You have to always be aware of the size you are storing, and that put a serious crimp in how you can work with this. The limited queries and the inability to sort also lead me to believe that this is a very purpose built tool.

OneNote’s server side is apparently one such use case, but from the look of things, I would expect that this is the other way around. That ADB is actually the backend of OneNote that Microsoft has decided to make public (like Dynamo’s in Amazon’s case).

Some of those limitations are probably going to be alleviated by using additional Microsoft tools. So the new Search Server (presumably that one has complex searching & sorting available) would allow you to do some proper queries, and HDInsight might be used for doing aggregation.

You aren’t going to be able to get the “show me the count of unread emails for this user” from Hadoop, not when the data is constantly changing. And using a secondary search server will introduce high latencies for the indexing. That is leaving aside the additional operational complexity of having to manage multiple systems (and the communication between them) just to get things done.

Here are a few things that would be hard to build in ADB, as it stands today:

  • This blog – the posts are too big, can’t sort posts by date, can’t do “complex” queries (tag & date & published & not deleted)
  • Logging – I actually thought that this would be a great use case, but we actually need to show logs by date. As well as be able to search using multiple fields (more than 3) or do contains queries.
  • Orders system –  important orders with a lot of line items will be rejected because of the size limitation.

In fact, I don’t know what would work there. What kind of data are you putting there? This isn’t good for bulk data work, because the ingest rate is really small (~500 writes / second? The debug version RavenDB does 2,500 writes per sec that on my dev laptop without even using the bulk insert API) and there isn’t a good way to work with large amount of data at once. It isn’t good for business applications, for the reasons outlined above.

I guess that if you patched this and the search server and Hadoop together you would get something that might be able to serve. But I think that the complexity involved is going to be very high, and I just don’t see where this would be a great solution.

In short, what is the problem that this is trying to solve? What application would be a perfect fit for this?

With RavenDB, the answer is simple, it is a general purpose database focused on OTLP applications. Until you have an answer, you can use RavenDB on Azure today using RavenHQ on Azure.

Inside RavenDB 3.0

I’ve been working for a while on seeing where we can improve RavenDB, and one of the things that I wanted to address is having an authoritative source to teach people about RavenDB. Not just documentation, those are very good for reference, but not so good to give you a guided tour and actually impart knowledge. That is what I wanted to do, to take the last five years or so of working on and with RavenDB and distill them.

The result is about a hundred pages or so (and likely to be three or four hundred pages). In other words, I slipped up and started churning out a book Smile.

You can download the alpha version using the following link (which will be valid for the next two weeks). I want to emphasis that this is absolutely unedited, and there are likely to be error for zpelling in grammar*. Those will be fixed down the line, currently I’m mostly focused on getting the content out. Here is also the temporary cover.

Cover

Comments are welcome. And yes, this will be an actual book, in the end, which you can hold in your hand and hopefully beat someone over the head if they need to smarten up.

* The errors in that particular sentence were intentional.

Tags:

Published at

Originally posted at

Comments (13)

NHibernate 4.0 released!

You can get it here!

This has a lot of bug fixes, fitting into the .NET 4.x eco system and in general good stuff.

Probably my favorite one is the SQL Server 2012 paging, which is so much easier to understand.

Happy hibernating…

Tags:

Published at

Originally posted at

Comments (6)

It is a good day, celebrate it

It is a good day, so I decided to share some joy.

For today only, we offer 21% discount for all our products. You can get that using coupon code: bzeiglglay

This applies to RavenDB (Standard, Enterprise and ISV), RavenDB Professional & Production Support and NHibernate Profiler and Entity Framework Profiler

This offer will be valid for 24 hours only.

Published at

Originally posted at

Comments (4)

Question 6 is a trap, a very useful one

In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.

Stop for a moment and ponder that. What do you assume that relative complexity of those questions?

 

 

 

 

 

 

Questions 1 –5 should take anything between 10 – 15  minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.

Question 6 require that you’ll create an index for a 15 TB CSV file, and allow efficient searching on it.

While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.

But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.

Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.

Candidate 1’s code looked something like this:

foreach(var line in File.EnumerateAllLines("big_data.csv"))
{
       var fields = line.Split(',');
       var email = line[2]
       File.WriteAllText(Md5(email), line);
}

Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.

So that isn’t going to work, but at least he had some idea about what is going on.

Candidate 2’s code, however, was:

// prepare
string[] allData = File.ReadAllLines("big_data.csv");
var users = new List<User>();
foreach(var line in allData)
{
     users.Add(User.Parse(line));
}
new XmlSerializer().Serialize(users, "big_data.xml");

// search by:

var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>()
users.AsParallel().Where(x=>x.Email == "the@email.wtf");

So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).

So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.

Tags:

Published at

Originally posted at

Comments (22)

Troubleshooting, when F5 debugging can’t help you

You might have noticed that we have been doing a lot of work on the operational side of things. To make sure that we give you as good a story as possible with regards to the care & feeding of RavenDB. This post isn’t about this. This post is about your applications and systems, and how you are going to react when !@)(*#!@(* happens.

In particular, the question is what do you do when this happens?

This situation can crop up in many disguises. For example, you might be seeing a high memory usage in production, or experiencing growing CPU usage over time, or see request times go up, or any of a hundred and one different production issues that make for a hell of a night (somehow, they almost always happen at nighttime)

Here is how it usually people think about it.

The first thing to do is to understand what is going on. About the hardest thing to handle in this situations is when we have an issue (high memory, high CPU, etc) and no idea why. Usually all the effort is spent just figuring out what and why.. The problem with this process for troubleshooting issues is that it is very easy to jump to conclusions and have an utterly wrong hypothesis. Then you have to go through the rest of the steps to realize it isn’t right.

So the first thing that we need to do is gather information. And this post is primarily about the various ways that you can do that. In RavenDB, we have actually spent a lot of time exposing information to the outside world, so we’ll have an easier time figuring out what is going on. But I’m going to assume that you don’t have that.

The end all tool for this kind of errors in WinDBG. This is the low level tool that gives you access to pretty much anything you can want. It is also very archaic and not very friendly at all. The good thing about it is that you can load a dump into it. A dump is a capture of the process state at a particular point in time. It gives you the ability to see the entire memory contents and all the threads. It is an essential tool, but also the last one I want to use, because it is pretty hard to do so. Dump files can be very big, multiple GB are very common. That is because they contain the full memory dump of the process. There is also mini dumps, which are easier to work with, but don’t contain the memory dump, so you can watch the threads, but not the data.

The .NET Memory Profiler is another great tool for figuring things out. It isn’t usually so good for production analysis, because it uses the Profiler API to figure things out, but it has a wonderful feature of loading dump files (ironically, it can’t handle very large dump files because of memory issuesSmile) and give you a much nicer view of what is going on there.

For high CPU situations, I like to know what is actually going on. And looking at the stack traces is a great way to do that. WinDBG can help here (take a few mini dumps a few seconds apart), but again, that isn’t so nice to use.

Stack Dump is a tool that takes a lot of the pain away for having to deal with that. Because it just output all the threads information, and we have used that successfully in the past to figure out what is going on.

For general performance stuff “requests are slow”, we need to figure out where the slowness actually is. We have had reports that run the gamut from “things are slow, client machine is loaded” to “things are slow, the network QoS settings throttle us”. I like to start by using Fiddler to start figuring those things out. In particular, the statistics window is very helpful:

image

The obvious things are the bytes sent & bytes received. We have a few cases where a customer was actually sending 100s of MB in either of both directions, and was surprised it took some time. If those values are fine, you want to look at the actual performance listing. In particular, look at things like TCP/IP connect, time from client sending the request to server starting to get it, etc.

If you found the problem is actually at the network layer, you might not be able to immediately handle it. You might need to go a level or two lower, and look at the actual TCP traffic. This is where something like Wire Shark comes into play, and it is useful to figure out if you have specific errors at  that level (for example, a bad connection that cause a lot of packet loss will impact performance, but things will still work).

Other tools that are very important include Resource Monitor, Process Explorer and Process Monitor. Those give you a lot of information about what your application is actually doing.

One you have all of that information, you can form a hypothesis and try to test it.

If you own the application in question, the best way to improve your chances of figuring out what is going on is to add logging. Lots & lots of logging. In production, having the logs to support what is going on is crucial. I usually have several levels of logging. For example, what is the traffic in/out of my system. Next there is the actual system operations, especially anything that happens in the background. Finally, there are the debug/trace endpoints that will expose internal state and allow you to tweak various things at runtime.

Having good working knowledge on how to properly utilize the above mention tools is very important, and should be considered to be much more useful than learning a new API or a language feature.

There is no WE in a Job Interview

This is a pet peeve of mine. When interviewing candidates, I’m usually asking some variant of “tell me about a feature you developed that you are proud of”. I’m using this question to gauge several metrics. Things like what is the candidate actually proud of, what was he working on, are they actually proud of what they did?

One of the more annoying tendencies is  for a candidate to give a reply in the form of “what we did was…”. In particular if s/he goes on to never mention something that s/he specifically did. And no, “led the Xyz team in…” is a really bad example.  I’m not hiring your team, in which case I might actually be interested in that. I’m actually interested in the candidate, personally. And if the candidate won’t tell me what it was that s/he did, I’m going to wonder if they played Solitaire all day.

Tags:

Published at

Originally posted at

Comments (23)

A tale of two interviews

We’ve been trying to find more people recently, and that means sifting trouble people. Once that process is done, we ask them to come to our offices for an interview. We recently had two interviews from people that were diametrically opposed to one another. Just to steal my own thunder, we decided not to go forward with either one of them. Before inviting them to an interview, I have them do a few coding questions at home. Those are things like:

  • Given a big CSV file (that fit in memory), allow to speedily query by name or email. The application will run for long period of time, and startup time isn’t very important.
  • Given a very large file (multiple TB), detect what 4MB ranges has changed in the file between consecutive runs of your program.

We’ll call the first one Joe. Joe has a lot of experience, he has been doing software for a long time, and has already had the chance to be a team lead in a couple of previous positions. He sent us some really interesting code. Usually I get a class or three in those answers. In this case, we got something that looked like this:

The main problem I had with his code is just finding where something is actually happening. I discarded the over architecture as someone who is trying to impress in an interview, “See all my beautiful ARCHITECTURE!”, and look at the actual code to actually do the task at hand, which wasn’t bad. 

Joe was full of confidence, he was articulate and well spoken, and appear to have a real passion for the architecture he sent us. “I’ve learned that it is advisable to put proper architecture first” and “That is now my default setting”. I disagree with those statements, but I can live with that.  What bothered me was that we asked a question along the way of “how would you deal with a high memory situation in an application”. What followed was several minutes of very smooth talk about motivating people, giving them the kind of support they need to do the job, etc. Basically, about the only thing it was missing was a part on “the Good of the People” and I would have considered whatever to vote for him. What was glaringly missing in my point of view was anything concrete and actionable.

On the other hand, we have Moe. He is a bit younger, but he already worked with NoSQL databases, which was a major plus. Admittedly, that was as a user of, instead of a developer of, but you can’t have it all. Moe’s code made me sit up and whistle. I setup an interview for the very next day, because looking at the code, there was someone there I wanted to talk to. It was very much to the point, and while it had idiosyncrasies, it showed a lot of promise. Here is the architecture for Moe’s code:

So Moe shows up at the office ,and we start the interview process. And right from the get go it is obvious that Moe is one of those people who don’t do too well in stressful situations like interviews. That is part of the reason why we ask candidates to write code at home. Because it drastically reduce the level of stress they have to deal with.

So I start talking, telling about the company and what we do. The idea is that hopefully this gives him time to compose himself. Then I start asking questions, and he gives mostly the right answers, but I’m lacking focus. I’m assuming that this is probably nervousness, so I bring up his code and go over that with him. He is much more comfortable talking about that. He had a O(logN) solution at one point, and I had to drive him toward an O(1) solution for the same problem, but he got there fairly quickly.

I then asked him what I considered to be a fairly typical question: “What areas you have complete mastery at?” This appear to have stumped him, since he took several minutes to give an answer which basically boiled down to “nothing”.

Okay… this guy is nervous, and he is probably under estimating himself, so let us try to focus the question. I asked whatever he was any good with HTML5 (not at all), then whatever he was good with server side work (have done some work there, but not an expert), and how he would deal with a high memory situation (look at logs, but after that he was stumped). When asked about the actual code he wrote for our test, he said that this was some of the hardest tasks he ever had to deal with.

That summed up to promising, but I’ve a policy of believing people when they tell me bad things about themselves. So this ended up being a negative, which was very frustrating.

The search continues…

Tags:

Published at

Originally posted at

Comments (14)

Complex indexing, simplified

RavenDB indexes are Turing complete, which means that you can do whatever you want with them. This is a very powerful feature, but it also come with a heavy burden. You can get yourself into some serious trouble. Take a look at this index:

image

 

We run into it during a troubleshooting session with a customer. And it was frankly quite hard to figure out what was going on.

Luckily, I could just throw this into RavenDB 3.0, and look at the indexing options:

image

This turned the above index into this:

image

Which was much clearer, but we could improve it a bit by removing the into clauses, so I ended up with:

image

Now, just from the following, can someone tell me what is the likely issue with this kind of index?

Tags:

Published at

Originally posted at

Comments (8)

We’re hiring… come work for us

unnamed

It seems that recently we have been going in rounds. Get more people, train them, and by the time they are trained, we already need more people Smile.

I guess that this is a good problem to have. At any rate, we are currently looking for an experience well rounded developer.

This job availability is for our offices in Hadera, Israel. If you aren’t from Israel, this isn’t for you.

This job is primarily for work on our Profilers line of products. Here is the laundry list:

  • Awesome .NET skills
  • Experience in UI development using WPF, MVVM style
  • Understanding how computers work and how to make them dance
  • History with concurrency & multi threading (concurrent work history not required)
  • Architecture / design abilities

I would like to see open source history, or projects that you can share (in other words, your projects, not employer’s code that you try to give to look at).

Please contact us at jobs@hibernatingrhinos.com if you are interested.

Published at

Originally posted at

Digging into MSMQ

I got into a discussion online about MSMQ and its performance. So I decided to test things out.

What I want to do is to check a few things, in particular, how much messages can I push to and from MSMQ in various configurations.

I created a non transactional queue, and then I run the following code:

var sp = Stopwatch.StartNew();
int count = 0;
while (sp.Elapsed.TotalSeconds < 10)
{
var message = new Message
{
BodyStream = new MemoryStream(data)
};
queue.Send(message);
count++;
}

Console.WriteLine(sp.Elapsed);
Console.WriteLine(count);

This gives me 181,832 messages in 10 seconds ,or 18,183 messages per second. I tried doing the same in a multi threaded fashion, with 8 threads writing to MSMQ, and got an insufficient resources error, so we’ll do this all in a single threaded tests.

Next, the exact same code, but for the Send line, which now looks like this:

queue.Send(message, MessageQueueTransactionType.Single);

This gives me 43,967 messages in 10 seconds, or 4,396 messages per second.

Next I added DTC, which gave me a total of 8,700 messages in ten seconds, or 870 messages per second! Yeah, DTC is evil.

Now, how about reading from it? I used the following code for that:

while (true)
{
try
{
Message receive = queue.Receive(TimeSpan.Zero);
receive.BodyStream.Read(data, 0, data.Length);
}
catch (MessageQueueException e)
{
Console.WriteLine(e);
break;
}
}

Reading from transactional queue, we get 5,955 messages per second for 100,000 messages. And using non transaction queue it can read about 16,000 messages a second.

Note that those are pretty piss poor “benchmarks”, they are intended more to give you a feel for the numbers than anything else.  I’ve mostly used MSMQ within the context of DTC, and it really hit the performance hard.

Tags:

Published at

Originally posted at

Comments (8)

On site Architecture & RavenDB consulting availabilities: Malmo & New York City

I’m going to have availability for on site consulting in Malmo, Sweden  (17 Sep) and in New York City, NY (end of Sep – beginning of Oct).

If you want me to come by and discuss what you are doing (architecture, nhibernate or ravendb), please drop me a line.

I’m especially interested in people who need to do “strange” things with data and data access. We are building a set of tailored database solutions for customers now, and we have seem customers show x750 improvement in performance when we gave them a database that was designed to fit their exact needs, instead of having to contort their application and their database to a seven dimensional loop just to try to store and read what they needed.

New RavenDB site design poll

As part of the 3.0 release, we are also going to do a full redesign of our website, and we would like to have your opinion on the matter.

Please take a look at the options and vote for your favorite: http://99designs.com/web-design/vote-u95dcs

Note, we’ll be changing the studio look & feel to match the website as well.

Tags:

Published at

Originally posted at

RavenDB 3.0 Ops: Live Tracing & Logging for production

You might have noticed a theme here Smile in where we are pushing RavenDB 3.0. This is actually an effect of how we structured our work plans, we did a lot of the new features (Voron, for example) early on, because they require a lot of time to mature. We now mostly complete the work related to user interface and exposing operational data.

This feature comes back to the old black box issue. What is the system doing? Usually, you have no way to tell. Now, you could enable debug logging, but that is a very invasive operation, requiring you to update the config file on the server, perhaps to restart the server, and isn’t really something that you can just do. This is especially true when we are talking about production system under load, where adding full logging can be very expensive.

You can now set a dynamic logging listener on a running instance, including a production instance:

image

Which then give you a live streaming view of the log:

image

Think about it like doing a tail on a log file, except that this allows you to dynamically configure what logs you are going to watch, and it will only log while you are watching. This is perfect for situations such as “what is this thing doing now?”.

Having access to the log file is great, but it usually have too much information. That is why we also added the ability to get a peek into what requests are actually executing now. This is also a production level feature, which will cause RavenDB to output all the requests so you can see them:

image

This can be very helpful in narrow down “what are the clients asking the server to do”.

Like the production log watch, this is also a feature that is meant for production, so you can subscribe to this information, and you’ll get it. But if there are no subscribers, there is not cost to this.

The HTTP trace feature can be used to watch a single database (which can be very useful on RavenHQ) or all databases (where you’ll need a server admin level of access. To watch the production log, you’ll need to be a server admin.

Tags:

Published at

Originally posted at

Comments (4)

RavenDB 3.0 Days in Sweden

We are going to do a European version of the RavenDB Conf in just over a month, coming to both Malmo and Stockholm for a full day event.

You can see the full details here, but the basic idea is that we are going to be talking about RavenDB 3.0, including showing off all the new stuff, then show a real world use case for managing high scalability systems with RavenDB. We’ll go in depth into the codebase, and then hear about how to make the best use of transformers and indexes and then end the day with a look forward into what has been slowly cooking in our labs and the grand finale with a full guide on how best to build RavenDB applications in RavenDB 3.0

We are actually going to arrive a day early, so if you are located in Malmo, and want us to come to do some on site RavenDB consulting or training on Sep 17, contract us (support@ravendb.net) and we’ll set it up.

You can register to the event using the following link.

Published at

Originally posted at

RavenDB support guarantees

As part of the 3.0 release of RavenDB, we are going to do a remap of our support contracts. We’ll make a formal announcement about it later, but the idea is to offer the following levels:

  • Standard – about $500 a year per serve, business day availability, maximum response within 2 business day.
  • Professional – about $2,000 a year per server, business day availability, maximum response within the same business day.
  • Enterprise – about $6,000 a year per server, 24x7, maximum response within two hours.

In addition to that, we’ll continue to have the community support via the mailing list. That said, I want to make it clear what kind of support guarantees with are giving in the mailing list:

  • None
  • Whatsoever

Put simply, the community mailing list is just that, a way for the community to discuss RavenDB. We look at that, and we try to help, but there is no one assigned to monitor the mailing list, this is pretty much the team waiting for the code to compile or the current test run to complete and deciding to check the mailing list instead of Twitter or the latest Cat Video.

Any support on the mailing list is provided on a ad hoc basis, and should absolutely not be something that you rely on. In particular, people with EMERGENCY or PRODUCTION ISSUE aren’t going to get any special treatment. If you need support, and if you run critical systems, you probably do, you need to purchase that. We provide guarantees and follow through for the commercial support packages.

I’m writing this post after an exchange of words in the mailing list when a user complained that I went offline at 1 AM on a Saturday night and not continue to provide him free support.

Published at

Originally posted at

Comments (9)

Guids are evil nasty little creatures that make me cry

You might have noticed that I don’t like Guids all that much. Guids seems like a great solution when you need to generate an id for something. And then reality intervenes, and you have a non understandable system problem.

Leaving aside the size of the Guid, or the fact that it is not sequential, two pretty major issues with an identifier, the major problem is that it is pretty much opaque for the users.

This was recently thrown in my face again as part of a question in the RavenDB mailing list. Take a look at the following documents. Do you think that those two documents belong to the same category or not?

image

One of the problems that we discovered was that the user was searching for category 4bf58dd8d48988d1c5941735, and the document had category was 4bf58dd8d48988d14e941735. And it drove everyone crazy about how could it be that this wasn’t working.

Here are those Guids again:

  • 4bf58dd8d48988d1c5941735
  • 4bf58dd8d48988d14e941735

Do you see it? I’m going to be putting some visual space and show you the difference.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Here they are:

  • 4bf58dd8d48988d1c5941735
  • 4bf58dd8d48988d14e941735

And if that isn’t enough for you to despise Guids. Feel free to read them to someone else over the phone, or try to find them in a log file. Especially when you have to deal with several of those dastardly things.

I have a cloud machine dedicated to generating and disposing Guids, I hope that in a few thousands years, I can kill them all.

Tags:

Published at

Originally posted at

Comments (27)

RavenDB 3.0 Status Update

We are nearly there. You can see that we are starting to expose all the cool stuff we have done through the past year.

Feature map is currently frozen, and we expect to have a release candidate within 2 weeks and a stable release 4 weeks after that.  There is one big ticket item that still remains, and the rest is just UI work. We’ll probably do the release candidate with an intentionally ugly UI, then roll the new theme for the UI for the actual release, since we don’t want to hold the schedule just for this release.

Things are very exciting here, and we can’t wait for you to take the fruits of our labor for spin. I trust you’ll be impressed Smile.

Tags:

Published at

Originally posted at

Comments (3)

Production analysis and trouble shooting with RavenDB

The annoying thing about software in production is that it is a  black box. It just sits there, doing something, and you have very little input into what. Oh, you can look at the CPU usage and memory consumption, you can try to figure out what is going on from the kind of things that the system will tell you this process is doing. But for the most part ,this is a black box. And not even one that is designed to let you figure out what just happened.

With RavenDB, we have made a very conscious effort to avoid being a black box. There are a lot of end points that you can query to figure out exactly what is going on. And you can use different endpoints to figure out different problems.  But in the end, while that was very easy for us to use, those aren’t really meant for end users. They are meant for our support engineers, mostly. 

We got tired of sending over “give me the output of the following endpoints” deal. We wanted a better story, something that would be easier and more convenient all around .So we sat down and thought about this, and came up with the idea of the Debug Info Package.

image

This deceptively simple tool will capture all of the relevant information from RavenDB into a single zip file that you can mail support. It will also give you a lot of details about the internals of RavenDB at the moment this was produced:

  • Recent HTTP requests
  • Recent logs
  • The database configuration
  • What is currently being indexed?
  • What are the current queries?
  • What tasks are being run?
  • All the database metrics
  • Current status of the pre-fetch queue
  • The database live stats

And if that wasn’t enough, we have the following feature as well:

image

 

We get the full stack of the currently running process!

You can see how this look in full in the here:

stacks

 

But the idea is that we have cracked open the black box, and it is now so much easier to figure out what is going on!

Tags:

Published at

Originally posted at

Comments (9)

RavenDB On Azure

It took a while, but it is here. The most requested feature on the Azure Store is here:

Embedded image permalink

This is currently only available on the East US region. That is going to change, but it will take a bit of time. You can vote on which regions you want RavenHQ on Azure to expand to.

RavenHQ on Azure can be used in one of two ways. You can purchase it via the Azure Marketplace, in which case you have to deal only with a single invoice, and you can manage everything through the Azure site. However, the Azure Marketplace doesn’t currently support prorated and tiered billing, which means that the plans that you purchase in the marketplace have hard limits on data. You could also purchase those same plans directly from RavenHQ and take advantage of usage based billing, which allows you to use more storage than what’s included in the plan at a prorated cost.

RavenHQ is now offering a lower price point for replicated plans, so you don’t have to think twice before jumping into the high availability option.

Tags:

Published at

Originally posted at

Comments (3)

What is my query doing?

Recently we had to deal with several customers support requests about slow queries in RavenDB. Now, just to give you some idea about the scope. We consider a query slow if it takes more than 50ms to execute (excluding client side caching).

In this case, we had gotten reports about queries that took multiple seconds to run. That was strange, and we were able to reproduce this locally, at which point we were hit with a “Duh!” moment. In all cases, the underlying issue wasn’t that the query took a long time to execute, it was that the result of the query was very large. Typical documents were in the multi megabyte ranges, and the query returned scores of those. That means that the actual cost of the query was just transporting the data to the client.

Let us imagine that you have this query:

session.Query<User>()
.Where(x => x.Age >= 21)
.ToList();

And for some reason it is slower than you would like. The first thing to do would probably be to see what is the raw execution times on the server side:
RavenQueryStatistics queryStats;
session.Query<User>()
.Customize(x=>x.ShowTimings())
.Statistics(out queryStats)
.Where(x => x.Age > 15)
.ToList();

Now you have the following information:
  • queryStats.DurationMilliseconds – the server side total query execution time
  • queryStats.TimingsInMilliseconds – the server side query execution time, per each distinct operation
    • Lucene search – the time to query the Lucene index
    • Loading documents – the time to load the relevant documents from disk
    • Transforming results – the time to execute the result transformer, if any
  • queryStats.ResultSize – the uncompressed size of the response from the server

This should give you a good indication on the relative costs.

In most cases, the issue was resolved by the customer by specifying a transformer and selecting just the properties they needed for their use case. That transformed (pun intended) a query that returned 50+ MB to one that returned 67Kb.

Tags:

Published at

Originally posted at