Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,127 | Comments: 50,009

Privacy Policy Terms
filter by tags archive
time to read 6 min | 1093 words

I mentioned that we are currently hiring for a junior dev position and we have been absolutely swamped with candidates. Leaving aside the divorce lawyer that tried to apply to the position and the several accountants (I don’t really get it either) we typically get people with very little experience.

In fact, this position is explicitly open to people with no experience whatsoever. Given that most junior positions require a minimum of two years, I think that got us a lot of candidates.

The fact that we don’t require prior experience doesn’t meant that we don’t have prerequisites, of course. We are a database company and the fundamentals are important to us. A typical task in RavenDB involves a lot of taxes, from ACID compliance, distributed computing, strict performance requirements, visibility into the actions of the database, readability of the code, etc.

I talked before about the cost of a bad hire, and in the nearly a decade that passed since I wrote that post, I hasn’t changed my mind. I would rather end up with no one than hire someone that isn’t a match for our needs.

Our interview process is composed of a phone call, a few coding questions and then an in person interview. At this point, given that I have been doing that for over a decade, I think that I interviewed well over 5,000 people. A job interview stresses some people out a lot. Yesterday I had to listen to a candidate speak so fast that I could barely understand the words and I had to stop a candidate and tell them that they are currently in the 95% percentile of people I spoke to, so they wouldn’t freeze because of a flubbed question.

I twitted(anonymously) about the ups and down of the process and seem to have created quite a lot of noise. A typical phone call for a potential candidate takes about 15 – 30 minutes and is mostly there to serve as an explicit filter. If they don’t meet the minimum requirements that we have, there is no point in wasting either of our time.

One of the questions that I ask is: Build a phone book application that stores the data in memory and outputs the records in lexical order. This can stump some people, so we have an extra question to help. Instead of trying to output the data in lexical order, how would you ensure that you don’t have a duplicate phone number in such a system? Scanning through the entire list of records each time is obviously not the way to go. If they still can’t think of a way to do that the next hint is to think about O(1) and what data structure would fit this requirement. On the Twitter thread, quite a few people were up in arms about that.

Building a phone book is the kind of task that I remember doing in high school programming class as a teenager. Admittedly, that was in Pascal, but I just checked six different computer science degrees and for all of them, data structures was a compulsory course. Moreover, things like “what is the complexity of this operation” are things that we do multiple times a day here. We are building a database here, so operations on data is literally our bread and butter. But even for “normal” operations, that is crucial. A common example, we need to display some information to the user about their database. The information actually come from two sources internally. One is the database list which contains various metadata and one is the active database instance, which can give us the running stats such as the number of requests for this database in the past minute.

Let’s take a look at this code:

The complexity of this code is O(N^2). In other words, for ten databases, it would cost us a hundred. But for 250 databases it would cost 62,500 and for 500 it would be 250,000. Almost the same code, but without the needless cost:

This is neither theoretical nor rare, and it is part of every programming curriculum that you’ll find. Further more, I’m not even asking about the theoretical properties of various algorithms, or asking a candidate to compute the complexity of a particular piece of code. I’m presenting a fairly simple and common task (make sure that a piece of data in unique) and asking how to perform that efficiently. 

From a lot of the reactions, it seems that plenty of people believe that data structures aren’t part of the fundamentals and shouldn’t be something that is deeply embedded in the mindset of developers. To me, that is like saying that a carpenter shouldn’t be aware of the difference between a nail or a screw.

Rob has an interesting thread on the topic, which I wanted to address specifically:

It does not matter what language, actually. In JavaScript, you’ll not use a “new Hashtable()”, you’ll use an object, sure, but that is a meaningless detail. In fact, arrays implemented as hashes in JavaScript maintain most of their important properties, actually. O(1) access time by index being the key. If you want to go deeper, than in practice, the JS runtime will usually use an actual array if it detects that this is how you use the variable.

And I’m sorry, that is actually really important for many reasons. The underlying structure of our data has huge impact on what you can do with that data and how you can operate on it. That is why you have ArrayBuffer and friends in JavaScript now, because it matters, a lot.

And to the people whose 30 years experience never included ever needing to know those details, I have two things to say:

  • Either your code is full of O(N^2) traps (which is sadly common) or you know that, because lists vs. hash is not something that you can really get away with.
  • In this company, implementing basic data structures is literally part of day to day tasks. Over the years, we needed to get customize versions of arrays, lists, dictionaries, trees and many more. This isn’t pie in the sky stuff, that is Monday morning.
time to read 3 min | 418 words

We are hiring again (this time for Junior C# Dev positions in Israel). That means that I go through CVs (quite a few, actually).  I like going over the resumes directly, to get a feel for not just a particular candidate but what is, for lack of a better term, the state of the market.

This time, I noticed a much higher percentage of resumes with a GitHub repository link. Anytime that I see such a link, I go and look at what they have there. That is often really interesting. Then again, you run into things like this:

image

On the one hand, this is non production code, it is obviously a teaching project, which is awesome. On the other hand, I find such code painful to look at.

In the past, I would rate highly anyone that would show a GitHub account in the CV, since I could expect to see some of their projects there, usually unique ones. This time? I’m seeing a lot of basically homework assignments, and those aren’t really that interesting to review or look at. Especially since a lot of the candidates apparently had the same courses, so I saw the same 5 projects repeated over and over again.

In other words, just a GitHub account with some repositories are no longer that interesting or meaningful.

Another thing that I noticed was that a lot of those candidates had profiles with profile pictures like:

image

A small tip, if you expect people to visit your profile (and I assume you do, since you provided the link in the resume), it is worth it to put a (professional) picture of yourself there. The profiler readme on GitHub is also surprising attractive when looking at a candidate.

Another tip, if you see a position for a C# Junior Developer, it is acceptable to apply if you don’t have all the requirements, or if you exceed them. But if you are trying to find a new job as a lawyer specializing in family law, maybe don’t try to apply to a tech company.

And yes, I’m using this post as a want to vent while going over so many CVs.

Most CVs are dry, but one candidate just got bumped to the next stage based solely on the fact that in they had a “Making awesome pancakes” in the CV, which made me laugh.

time to read 4 min | 705 words

I want to comment on the following tweet:

When I read it, I had an immediate and visceral reaction. Because this is one of those things that sound nice, but is actually a horrible dystopian trap. It confused two very important concepts and put them in the wrong order, resulting in utter chaos.

The two ideas are “writing tests” and “producing high quality code”. And they are usually expressed in something like this:

We write tests in order to product high quality code.

Proper tests ensure that you can make forward progress without having regressions. They are a tool you use to ensure a certain level of quality as you move forward. If you assume that the target is the tests and that you’ll have high quality code because of that, however, you end up in weird places. For example, take a look at the following set of stairs. They aren’t going anywhere, and aside from being decorative, serves no purpose.

image

When you start considering tests themselves to be the goal, instead of a means to achieve it, you end up with decorative tests. They add to your budget and make it harder to change things, but don’t benefit the project.

There are a lot of things that you are looking for in code review that shouldn’t be in the tests. For example, consider the error handling strategy for the code. We have an invariant that says that exceptions may no escape our code when running in a background thread. That is because this will kill the process. How would you express something like that in a test? Because you end up with an error raised from a write to a file that happens when the disk is full that kills the server.

Another example is a critical piece of code that needs to be safely handle out of memory exceptions. You can test for that, sure, but it is hard and expensive. It also tend to freeze your design and implementation, because you are now testing implementation concerns and that make it very hard to change your code.

Reviewing code for performance pitfalls is also another major consideration. How do you police allocations using a test? And do that without killing your productivity? For fun, the following code allocates:

Console.WriteLine(1_000_000);

There are ways to monitor and track these kind of things, for sure, but they are very badly suited for repeatable tests.

Then there are other things that you’ll cover in the review, more tangible items. For example, the quality of error messages you raise, or the logging output.

I’m not saying that you can’t write tests for those. I’m saying that you shouldn’t. That is something that you want to be able to change and modify quickly, because you may realize that you want to add more information in a certain scenario. Freezing the behavior using tests just means that you have more work to do when you need to make the changes. And reviewing just test code is an even bigger problem when you consider that you need to consider interactions between features and their impact on one another. Not in terms of correctness, you absolutely need to test that, but in terms of behavior.

The interleaving of internal tasks inside of RavenDB was careful designed to ensure that we’ll be biased in favor of user facing operations, starving background operations if needed. At the same time, it means that we need to ensure that we’ll give the background tasks time to run. I can’t really think about how you would write a test for something like that without basically setting in stone the manner in which we make that determination. That is something that I explicitly don’t want to do, it will make changing how and what we do harder. But that is something that I absolutely consider in code reviews.

time to read 3 min | 559 words

I’m very happy to announce that we have recently released version 6.0 of Entity Framework Profiler and NHibernate Profiler. Here are some of the highlights:

image

This new version brings quite a lot to the table. We applied a lot of lessons to optimize the performance of the profiler so it can process events faster and in a more efficient manner. We also fully integrated it with the async/await model that became so popular. For Entity Framework users, we now support all versions of EF, running on .NET 4.7 all the way to .NET 5.0 and anything in between.

What I think is the crown jewels of this release, however, is the new Azure integration feature. We initially built this feature to allow you to profiler serverless code and Azure Functions, but it turned out that this is really useful.

The Profiler Azure Integration allows you to setup a storage container on Azure that would accept the profiled output from your application. Pretty simple, right? The profiler application will monitor the container and show you the events as they are written, so even if you don’t have direct access to profiler the code (very common on serverless / Azure scenarios), you can still profiler your code. That helps a lot as well when talking about running inside containers, Kubernetes, etc. The usual way the profiler and the application communicate is over a TCP channel, but given the myriad of network topologies that are now in common use, this can get complex. By utilizing Azure Integration, we can resolve the whole solution and give you a seamless way to handle profile your code.

On demand profiling of your code is just one part of the new feature. You can now do continuous profiling of your system. Because we throw the events into a blob container in Azure, and given that the data is compressed and cheap to store, you can now afford to simply record all the queries in your application and come back to it a few days or weeks later and look into interesting behaviors.

That is also a large part of why we worked on improving our performance, we expect to be dealing with a lot of data.

You are also able to specify a certain timeframe for this kind of profiling, as well as specify whatever we should remove the data after looking on it or retain it. The new profiling feature gives you a way to answer: “What queries run in production on Monday between 9:17 and 10:22 AM”. This comes with all of the usual benefits of the profiler, meaning that:

  • You can correlate each query to the exact line of code that triggered it.
  • The profiler analyze the database interaction and raise alerts on bad behaviors.

The last part is done not on a single query but with access to the full state of the system. It can allow you to find hot spots in your program, patterns of data access that leads to much higher latency and cost a lot more money.

To celebrate the new release, we are offering the profilers with 20% discount for the new quarter as well as offer a bundling option. You can get the Entity Framework Profiler or the NHibernate Profiler bundled with our Cosmos DB Profiler.

time to read 2 min | 204 words

imageRavenDB Cloud is now offering HIPAA Compliant Accounts.

HIPAA stands for Health Insurance Portability and Accountability Act and is a set of rules and regulations that health care providers and their business associates need to apply.

That refers to strictly limiting access to Personal Health Information (PHI) and Personally Identifying Information (PII) as well as audit and security requirements. In short, if you deal with medical information in the states, this is something that you need to deal with. In the rest of the world, there are similar standards and requirements.

With HIPAA compliant accounts, RavenDB Cloud takes on itself a lot of the details around ensuring that your data is stored in a safe environment and in a manner that match the HIPAA requirements. For example, the audit logs are maintained for a minimum of six years. In addition, there are further protections on accessing your cluster and we enforce a set of rules to ensure that you don’t accidently expose private data.

This feature ensures that you can easily run HIPAA compliant systems on top of RavenDB Cloud with a minimum of hassle.

time to read 3 min | 414 words

imageThe very first product of Hibernating Rhinos was a profiler for NHibernate, to allow you to figure out exactly what is going between your database and application. Now I’m proud to present our latest product: the Cosmos DB Profiler.

If you are using Azure, you are likely familiar with Cosmos DB. Cosmos DB is not a traditional relational database. It is marketed by Microsoft as a multi model database and it is widely known in the world of distributed databases. The first part is important enough to bear repeating. Cosmos DB is not a relational database, even if there is a tendency to treat it as such.

We have gathered everything we know about optimal database usage, mixed in all the experience we run into seeing users bump into issue working with distributed systems and then looked into all the best practices published about successful Cosmos DB applications. After we had all of that, we looked into patterns, things that we can do for you, automatically, that would prevent you from messing up. Thus, the Cosmos DB profiler was born.

Here is how it looks like, profiling an application locally:

image

As you can see, it give you context to the interaction between your application and the database. It allows you to see exactly what is going on behind the scenes. This is important, since most Cosmos DB applications aren’t trivial, we are usually talking about big applications with a lot of data and moving pieces. It can be hard to understand what is actually going on when you run a particular action.

Furthermore, the profiler is able to give you concrete suggestions that will improve your performance and reduce you cloud bills.

image

The pricing model for Cosmos DB is based on provisioned capacity, and it is very easy to get into a state where you need to provision a lot more than what you expected to need. The profiler is able to detect such issues, provide you with concrete recommendations on how to fix them and show you the savings, immediately.

I’m doing a webinar on the Cosmos DB profiler on Tuesday and I would love to see you there.

time to read 4 min | 628 words

I posted about the @refresh feature in RavenDB, explaining why it is useful and how it can work. Now, I want to discuss a possible extension to this feature. It might be easier to show than to explain, so let’s take a look at the following document:

The idea is that in addition to the data inside the document, we also specify behaviors that will run at specified times. In this case, if the user is three days late in paying the rent, they’ll have a late fee tacked on. If enough time have passed, we’ll mark this payment as past due.

The basic idea is that in addition to just having a @refresh timer, you can also apply actions. And you may want to apply a set of actions, at different times. I think that the lease payment processing is a great example of the kind of use cases we envision for this feature. Note that when a payment is made, the code will need to clear the @refresh array, to avoid it being run on a completed payment.

The idea is that you can apply operations to the documents at a future time, automatically. This is a way to enhance your documents with behaviors and policies with ease. The idea is that you don’t need to setup your own code to execute this, you can simply let RavenDB handle it for you.

Some technical details:

  • RavenDB will take the time from the first item in the @refresh array. At the specified time, it will execute the script, passing it the document to be modified. The @refresh item we are executing will be removed from the array. And if there are additional items, the next one will be schedule for execution.
  • Only the first element in the @refresh array only. So if the items aren’t sorted by date, the first one will be executed and the persisted again. The next one (which was earlier than the first one) is already ready for execution, so will be run on the next tick.
  • Once all the items in the @refresh array has been processed, RavenDB will remove the @refresh metadata property.
  • Modifications to the document because of the execution of @refresh scripts are going to be handled as normal writes. It is just that they are executed by RavenDB directly. In other words, features such as optimistic concurrency, revisions and conflicts are all going to apply normally.
  • If any of the scripts cause an error to be raised, the following will happen:
    • RavenDB will not process any future scripts for this document.
    • The full error information will be saved into the document with the @error property on the failing script.
    • An alert will be raised for the operations team to investigate.
  • The scripts can do anything that a patch script can do. In other words, you can put(), load(), del() documents in here.
  • We’ll also provide a debugger experience for this in the Studio, naturally.
  • Amusingly enough, the script is able to modify the document, which obviously include the @refresh metadata property. I’m sure you can imagine some interesting possibilities for this.

We also considered another option (look at the Script property):

The idea is that instead of specifying the script to run inline, we can reference a property on a document. The advantage being is that we can apply changes globally much easily. We can fix a bug in the script once. The disadvantage here is that you may be modifying a script for new values, but not accounting for the old documents that may be referencing it. I’m still in two minds about whatever we should allow a script reference like this.

This is still an idea, but I would like to solicit your feedback on it, because I think that this can add quite a bit of power to RavenDB.

time to read 4 min | 618 words

Exactly 9 years ago, Hibernating Rhinos had a major breakthrough. We moved to our own offices for the first time. Before that, I was mostly working from a home office of clients’ locations.  Well, I say we, but I mean I. At the time, the change mostly involved me having to put on some shoes and going out of the house to work alone in a big empty office. The rest of the team at the time was completely remote.

I got the office because I needed to. Some people can manage a proper life / work balance while working from home. I find it very hard. I’m the kind of person that would get up at 2 AM to get something to drink, see a new mail notification on the monitor, and start working until 8 AM. Having a separate office was hugely beneficial for me.  The other reason was that it allowed me to hire more people locally. The first real employee I had was hired within three months of moving to the new office.

That first office was great, but small. Just 5 rooms about about 120 m² (1300 ft²). We stayed in the office until we got to about 12 people. At this point, we really didn’t have enough room to swing a cat (to be fair, we didn’t have an office cat, nor a real good reason to want to swing one). We moved offices in 2015, from the center of the industrial zone of the city to the periphery of the business zone). The new offices were 250 m² (2700 ft²) and gave us a lot of room to expand, it also had two major advantages. It was nice to be able to walk downstairs and be able to walk to pretty much anywhere we needed to and we no longer had to deal with having a garage next door.

When we moved to the 2nd office, it felt like we had a huge amount of room, but it filled up quite quickly. It was certain that we would outgrow the new place in a short order, so we started looking for a permeant home that would suffice for the next 10 years or so. We got one, smack down in the center of the business zone of the city. Next door to city hall, actually. Well, I say “got one”. What we actually got was a piece of paper and a hole in the ground. Before we could move into the new offices, they had to be built first.

We stayed in the second office space for 3 years, but we run out of room before the new offices were ready. So we moved for the third time. Because our new offices weren’t ready, we moved to a shared working space (like WeWork). We planned on being there for a short while, but it ended up for over a year. The plus side, we were able to expand much more easily. We hired quite a few people this year and was able to simple add more offices as we grew. The down side was that this is very much not our office, so we really want to move.

This week, however, we are going to finally move. The new offices have more than enough space  415 m² (4500 ft²) for the new five to ten years of growth, it covers two floors in a brand new location, centrally located and beautifully done. I’m not posting any pictures because the vast majority of our own team haven’t seen it yet (we have a unveiling party tomorrow), but I’m super happy that we got to this point and just had to share in the blog.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
  2. Webinar recording (12):
    15 Jan 2021 - Filtered Replication in RavenDB
  3. Production postmortem (30):
    07 Jan 2021 - The file system limitation
  4. Open Source & Money (2):
    19 Nov 2020 - Part II
  5. re (27):
    27 Oct 2020 - Investigating query performance issue in RavenDB
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats