Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 10 min | 1880 words

In my previous post, we dealt with how to model Auctions and Products, this time, we are going to look at how to model bids.

Before we can do that, we need to figure out how we are going to use them. As I mentioned, I am going to use Ebay as the source for “application mockups”.  So I went to Ebay and took a couple of screen shots.

Here is the actual auction page:

image

And here is the actual bids page.

image

This tells us several things:

  • Bids aren’t really accessed for the main page.
  • There is a strong likelihood that the number of bids is going to be small for most items (less than a thousand).
  • Even for items with a lot of bids, we only care about the most recent ones for the most part.

This is the Auction document as we have last seen it:

{
   "Quantity":15,
   "Product":{
      "Name":"Flying Monkey Doll",
      "Colors":[
         "Blue & Green"
      ],
      "Price":29,
      "Weight":0.23
   },
   "StartsAt":"2011-09-01",
   "EndsAt":"2011-09-15"
}

The question is where are we putting the Bids? One easy option would be to put all the bids inside the Auction document, like so:

{
   "Quantity":15,
   "Product":{
      "Name":"Flying Monkey Doll",
      "Colors":[
         "Blue & Green"
      ],
      "Price":29,
      "Weight":0.23
   },
   "StartsAt":"2011-09-01",
   "EndsAt":"2011-09-15",
   "Bids": [
     {"Bidder": "bidders/123", "Amount": 0.1, "At": "2011-09-08T12:20" }
   ]
}

The problem with such an approach is that we are now forced to load the Bids whenever we want to load the Auction, but the main scenario is that we just need the Auction details, not all of the Bids details. In fact, we only need the count of Bids and the Winning Bid, it will also fail to handle properly the scenario of High Interest Auction, one that has a lot of Bids.

That leave us with few options. One of those indicate that we don’t really care about Bids and Auction as a time sensitive matter. As long as we are accepting Bids, we don’t really need to give you immediate feedback. Indeed, this is how most Auction sites work. They give you a cached view of the data, refreshing it every 30 seconds or so. The idea is to reduce the cost of actually accepting a new Bids to the minimum necessary. Once the Auction is closed, we can figure out who actually won and notify them.

A good design for this scenario would be a separate Bid document for each Bid, and a map/reduce index to get the Winning Bid Amount and Big Count. Something like this:

     {"Bidder": "bidders/123", "Amount": 0.1, "At": "2011-09-08T12:20", "Auction": "auctions/1234"}
     {"Bidder": "bidders/234", "Amount": 0.15, "At": "2011-09-08T12:21", "Auction": "auctions/1234" }
     {"Bidder": "bidders/123", "Amount": 0.2, "At": "2011-09-08T12:22", "Auction": "auctions/1234" }

And the index:

from bids in docs.Bids
select new { Count = 1, bid.Amount, big.Auction }

select result from results
group result by result.Auction into g
select new 
{
   Count = g.Sum(x=>x.Count),
   Amount = g.Max(x=>x.Amount),
   Auction = g.Key
}

As you can imagine, due to the nature of RavenDB’s indexes, we can cheaply insert new Bids, without having to wait for the indexing to work. And we can always display the last calculated value of the Auction, including what time it is stable for.

That is one model for an Auction site, but another one would be a much stringer scenario, where you can’t just accept any Bid. It might be a system where you are charged per bid, so accepting a known invalid bid is not allowed (if you were outbid in the meantime). How would we build such a system? We can still use the previous design, and just defer the actual billing for a later stage, but let us assume that this is a strong constraint on the system.

In this case, we can’t rely on the indexes, because we need immediately consistent information, and we need it to be cheap. With RavenDB, we have the document store, which is ACIDly consistent. So we can do the following, store all of the Bids for an Auction in a single document:

{
   "Auction": "auctions/1234",
   "Bids": [
     {"Bidder": "bidders/123", "Amount": 0.1, "At": "2011-09-08T12:20", "Auction": "auctions/1234"}
     {"Bidder": "bidders/234", "Amount": 0.15, "At": "2011-09-08T12:21", "Auction": "auctions/1234" }
     {"Bidder": "bidders/123", "Amount": 0.2, "At": "2011-09-08T12:22", "Auction": "auctions/1234" }
    ]
}

And we modify the Auction document to be:

{
   "Quantity":15,
   "Product":{
      "Name":"Flying Monkey Doll",
      "Colors":[
         "Blue & Green"
      ],
      "Price":29,
      "Weight":0.23
   },
   "StartsAt":"2011-09-01",
   "EndsAt":"2011-09-15",
   "WinningBidAmount": 0.2,
   "BidsCount" 3
}

Adding the BidsCount and WinningBidAmount to the Auction means that we can very cheaply show them to the users. Because RavenDB is transactional, we can actually do it like this:

using(var session = store.OpenSession())
{
  session.Advanced.OptimisticConcurrency = true;
  
  var auction = session.Load<Auction>("auctions/1234")
  var bids = session.Load<Bids>("auctions/1234/bids");
  
  bids.AddNewBid(bidder, amount);
  
  auction.UpdateStatsFrom(bids);
  
  session.SaveChanges();
}

We are now guaranteed that this will either succeed completely (and we have a new winning bid), or it will fail utterly, leaving no trace. Note that AddNewBid will reject a bid that isn’t the higher (throw an exception), and if we have two concurrent modifications, RavenDB will throw on that. Both the Auction and its Bids are treated as a single transactional unit, just the way it should.

The final question is how to handle High Interest Auction, one that gather a lot of bids. We didn’t worry about it in the previous model, because that was left for RavenDB to handle. In this case, since we are using a single document for the Bids, we need to take care of that ourselves. There are a few things that we need to consider here:

  • Bids that lost are usually of little interest.
  • We probably need to keep them around, just in case, nevertheless.

Therefor, we will implement splitting for the Bids document. What does this means?

Whenever the number of Bids in the Bids document reaches 500 Bids, we split the document. We take the oldest 250 Bids and move them to Historical Bids document, and then we save.

That way, we have a set of historical documents with 250 Bids each that no one is ever likely to read, but we need to keep, and we have the main Bids document, which contains the most recent (and relevant Bids. A High Interest Auction might end up looking like:

  • auctions/1234 <- Auction document
  • auctions/1234/bids <- Bids document
  • auctions/1234/bids/1 <- historical bids #1
  • auctions/1234/bids/2 <- historical bids #2

And that is enough for now I think, this post went on a little longer than I intended, but hopefully I was able to explain to you both the final design decisions and the process used to reach them.

Thoughts?

time to read 2 min | 216 words

One of the things that we ask some of our interviewees is to give us a project that would answer the following:

We need a reusable library to manage phone books for users. User interface is not required, but we do need an API to create, delete and edit phone book entries. An entry contains a Name (first and last), type (Work, Cellphone or Home) and number. Multiple entries under the same name are allowed. The persistence format of the phone book library is a file, and text based format such as XML or Json has been ruled out.

In addition to creating / editing / deleting, the library also need to support iterating over the list in alphabetical order or by the first or last name of each entry.

The fun part with this question is that it is testing so many things at the same time, it gives me a lot of details about the kind of candidate that I have in front of me. From their actual ability to solve a non trivial problem, the way they design and organize code, the way they can understand and implement a set of requirements, etc.

The actual problem is something that I remember doing as an exercise during high school (in Pascal, IIRC).

time to read 2 min | 307 words

A chicken, in this case, is the same chicken from the Pig & Chicken who wanted to open the eggs & ham place. This is a term used in agile a lot.

There are many teams who feel that being responsive to client demands is a Good Thing. In general, they are usually right, but you have to be very aware who is asking, and what stakes they have in the game. If they don’t own the budget for your team, they don’t get to ask for features and get a “sure thing” automatically.

Case in point, I was asked by another team in a totally different company what direction they should go for a decision that directly impact my software. I am using their stuff, and as such, they sought my feedback. The problem is that my recommendation was based on what I actually needed. They had two options, one of which would take a week or two, and would provide the basic set of services. The other would take several months to develop, but would allow me to create much better options for my users.

I think that you can guess what I ended up recommending, since from my point of view, there is absolutely no down side whatsoever. If they end up implementing the basic stuff, that is okay. If they implement the advanced stuff, that is great. At any case, my cost end up being zero.

I am a chicken in this game, and I want the biggest piece of meat (but make it non pig) that I can get, since I am eating on the house.

Whenever you let customer feedback into the loop, you have to take that piece into account. Customers are going to favor whatever it is that benefit them, that isn’t the same as whatever benefits you.

time to read 4 min | 686 words

This article thinks so, and I was asked to comment on that. I have to say that I agree with a lot in this article. It starts by laying out what an anti pattern is:

  1. It initially appears to be beneficial, but in the long term has more bad consequences than good ones
  2. An alternative solution exists that is proven and repeatable

And then goes on to list some of the problems with OR/M:

  • Inadequate abstraction - The most obvious problem with ORM as an abstraction is that it does not adequately abstract away the implementation details. The documentation of all the major ORM libraries is rife with references to SQL concepts.
  • Incorrect abstraction – …if your data is not relational, then you are adding a huge and unnecessary overhead by using SQL in the first place and then compounding the problem by adding a further abstraction layer on top of that.
    On the the other hand, if your data is relational, then your object mapping will eventually break down. SQL is about relational algebra: the output of SQL is not an object but an answer to a question.
  • Death by a thousand queries – …when you are fetching a thousand records at a time, fetching 30 columns when you only need 3 becomes a pernicious source of inefficiency. Many ORM layers are also notably bad at deducing joins, and will fall back to dozens of individual queries for related objects.

If the article was about pointing out the problems in OR/M I would have no issues in endorsing it unreservedly. Many of the problems it points out are real. They can be mitigated quite nicely by someone who knows what they are doing, but that is beside the point.

I think that I am in a pretty unique position to answer this question. I have over 7 years of being heavily involved in the NHibernate project, and I have been living & breathing OR/M for all of that time. I have also created RavenDB, a NoSQL database, that gives me a good perspective about what it means to work with a non relational store.

And like most criticisms of OR/M that I have heard over the years, this article does only half the job. It tells you what is good & bad (most bad) in OR/M, but it fails to point out something quite important.

To misquote Churchill, Object Relational Mapping is the worst form of accessing a relational database, except all of the other options when used for OLTP.

When I see people railing against the problems in OR/M, they usually point out quite correctly problems that are truly painful. But they never seem to remember all of the other problems that OR/M usually shields you from.

One alternative is to move away from Relational Databases. RavenDB and the RavenDB Client API has been specifically designed by us to overcome a lot of the limitations and pitfalls inherit to OR/M. We have been able to take advantage of all of our experience in the area and create what I consider to be a truly awesome experience.

But if you can’t move away from Relational Databases, what are the alternative? Ad hoc SQL or Stored Procedures? You want to call that better?

A better alternative might be something like Massive, which is a very thin layer over SQL. But that suffers from a whole host of other issues (no unit of work means aliasing issues, no support for eager load means better chance for SELECT N+1, no easy way to handle migrations, etc). There is a reason why OR/M have reached where they have. There are a lot of design decisions that simply cannot be made any other way without unacceptable tradeoffs.

From my perspective, that means that if you are using Relational Databases for OLTP, you are most likely best served with an OR/M. Now, if you want to move away from Relational Databases for OLTP, I would be quite happy to agree with you that this is the right move to make.

time to read 1 min | 98 words

We are working on the new version of RavenDB Studio, and it has became clear very quickly that while we might be good in producing software, we are most certainly not good at making it look good.

Therefor, I would like to get some help from someone who can actually take an ugly duckling and make it into a beautiful swan.

If you are interested, I would be very happy if you can contact me.

time to read 1 min | 65 words

Something that we have started to recently do is just to record some of our customer interactions*, and then post that to our You Tube account.

The following is a discussion with Nick VanMatre, Solutions Architect at Archstone, about how to scale their RavenDB usage. I think you’ll find it interested.

* Nit picker corner: Obviously, with their permission.

time to read 1 min | 196 words

People seems to be more interested in answering the question than the code that solved it. Actually, people seemed to be more interested in outdoing one another in creating answers to that. What I found most interesting is that a large percentage of the answers (both in the blog post and in the interviews) got a lot of that wrong.

So here is the question in full. The following table is the current tax rates in Israel:

  Tax Rate
Up to 5,070 10%
5,071 up to 8,660 14%
8,661 up to 14,070 23%
14,071 up to 21,240 30%
21,241 up to 40,230 33%
Higher than 40,230 45%

Here are some example answers:

  • 5,000 –> 500
  • 5,800 –> 609.2
  • 9,000 –> 1087.8
  • 15,000 –> 2532.9
  • 50,000 –> 15,068.1

This problem is a bit tricky because the tax rate doesn’t apply to the whole sum, only to the part that is within the current rate.

time to read 5 min | 853 words

One of the things that I really hate is to be reminded anew how stupid some people are. Or maybe it is how stupid they think I am.  One of the things that we are doing during interviews is to ask candidates to do some fairly simple code tasks. Usually, I give them an hour or two to complete that (using VS and a laptop), and if they don’t complete everything, they can do that at home and send me the results.

This is a piece of code that one such candidate has sent. To be clear, this is something that the candidate has worked on at home and had as much time for as she wanted:

public int GetTaxs(int salary)
{
    double  net, tax;

    switch (salary)
    {
        case < 5070:
            tax = salary  * 0.1;
            net=  salary  - tax ;
            break;

        case < 8660:
        case > 5071:
            tax = (salary - 5071)*0.14;
            tax+= 5070 * 0.1;
            net = salary-tax;   
            break;
        case < 14070:
        case > 8661:
            tax=  (salary - 8661)*0.23;
            tax+= (8661 - 5071 )*0.14;
            tax+= 5070 *0.1;
            net=  salary - tax;
            break;
        case <21240:
        case >14071:
            tax=  (salary- 14071)*0.3;
            tax+= (14070 - 8661)*0.23;
            tax+= (8661 - 5071 )*0.14;
            tax+= 5070 *0.1;
            net= salary - tax;
            break;
        case <40230:
        case >21241:
            tax=  (salary- 21241)*0.33;
            tax+= (21240 - 14071)*0.3;
            tax+= (14070 - 8661)*0.23;
            tax+= (8661 - 5071 )*0.14;
            tax+= 5070 *0.1;
            net= salary - tax;
            break;
        case > 40230:
            tax= (salary - 40230)*0.45;
            tax+=  (40230- 21241)*0.33;
            tax+= (21240 - 14071)*0.3;
            tax+= (14070 - 8661)*0.23;
            tax+= (8661 - 5071 )*0.14;
            tax+= 5070 *0.1;
            net= salary - tax;
            break;
        default:
            break;
    }
}

Submitting code that doesn’t actually compiles is a great way to pretty much ensures that I won’t hire you.

time to read 2 min | 271 words

I got some requests to make RavenMQ an OSS project. And I thought that I might explain the thinking behind why I don’t want to do that.

Put simply, I have never thrown a significant amount of code over the wall for other people to deal with. Oh, I have done it with a lot of small projects ( < ~2,000 LOC ) which I assume that most people can figure out in an hour or less, but a significant, non trivial amount of software? Never done that.

It doesn’t feel right. More than that, it isn’t likely to actually work. Even mature, multiple contributors projects have a hard time to do a leader shift, if they were structured as a single person effort. To do so on a brand new codebase which no one really knows? That is a recipe for either tying me up with support or creating a bad impression if someone doesn’t get the code to work.  One of the things that I learned from many years of working with Open Source software is that the maturity level of the project counts, and that just throwing code over the wall is a pretty bad way of ensuring that a project will survive and thrive.

And then there is another issue, I don’t believe that RavenMQ is as valuable now that SignalR is out there. You can do pretty much whatever you could do with RavenMQ with SignalR, and that means that as far as everyone is concern, this is a pure win. There isn’t a need to create a separate project simply to have a separate project.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}