JAOOMore on Evolving the Key/Value Programming Model to a Higher Level from Billy Newport

time to read 6 min | 1099 words

As I already mentioned, this presentation had me thinking. Billy presented a system called Redis, which is a Key/Value store which is intended for an attribute based storage.

That means that storing something like User { Id = 123, Name = “billy”, Email = “billy@example.org”} is actually stored as:

{ "uid:123:name": "billy" } 
{ "uid:123:email": "billy@example.org" }
{ "uname:billy": "123" } 

Each of those lines represent a different key/value pair in the Redis store. According to Billy, this has a lot of implications. On the advantage side, you get no schema and very easy support for just adding stuff as you go along. On the other hand, Redis supports not transactions and it is easy to “corrupt” the database during development (usually as a result of a programming bug).

What actually bothered me the most was the implications on the number of remotes calls that are being made. The problem shows itself very well in this code sample, (a twitter clone), which show inserting a new twit:

long postid = R.str_long.incr("nextPostId"); 
long userId = PageUtils.getUserID(request); 
long time = System.currentTimeMillis(); 
string post = Long.toString(userId)+"|" + Long.toString(time)+"|"+status; 
R.c_str_str.set("p:"+Long.toString(postid), post);
List<long> followersList = R.str_long.smembers(Long.toString(userId)+":followers"); 
if(followersList == null) 
   followersList - new ArrayList<Long>(); 
HashSet<long> followerSet = new HashSet<long>(followersList); 
long replyId = PageUtils.isReply(status); 
if(replyId != -1) 
for(long i : followerSet) 
    R.str_long.lpush(Long.toString(i)+":posts", postid); 
// -1 uid is global timeline 
String globalKey = Long.toString(-l)+":posts"; 
R.str_long.ltrim(globolKey, 200);

I really don’t like the API, mostly because it reminds me of C, but the conventions are pretty easy to figure out.

  • R is the static gateway into the Redis API
  • str_long = store ong
  • c_str_str – store string and keep it in nearby cache

The problem with this type of code is the number of remote calls and the locality of those calls. With a typical sharded set of servers, you are going to have lots of calls going all over the place. And when you get into people that have thousands and millions of followers, the application is simply going to die.

A better solution is required. Billy suggested using async programming or sending code to the data store to execute there.

I have a different variant on the solution.

We will start from the assumption that we really want to reduce remote calls, and that the system performance in the face of large amount of writes (without impacting reads) is important. The benefits of using something like Redis is that it is very easy to get started, very easy to change things around and great for rapid development mode. We want to keep that for now, so I am going to focus on a solution based on the same premise.

The first thing to go is the notion that a key can sit anywhere that it wants. In a key/value store, it is important to be able to control locality of reference. We change the key format so it is now: [server key]@[local key]. What does this mean? It means that for the previously mentioned user, this is the format it will be stored as:

{ "uid:123@name": "billy" } 
{ "uid:123@email": "billy@example.org" }
{ "uname@billy": "123" } 

We use the first part of the key (before the @) to find the appropriate server. This means that everything with a prefix of “uid:123” is known to reside on the same server. This allow you to do things like transactions on a single operation of setting multiple keys.

Once we have that, we can start adding to the API. Instead of getting a single key at a time, we can get a set of values in one remote call. That has the potential of significantly reducing the number of remote calls we will make.

Next, we need to consider repeated operations. By that I mean anything where we have a look in which we call to the store. That is a killer when you are talking about any data of significant size. We need to find a good solution for this.

Billy suggested sending JRuby script to the server (or similar) and executing it there, saving the network roundtrips. Which this is certainly possible, I think it would be a mistake. I have a much simpler solution. Teach the data store about repeated operations. Let us take as a good example the copying that we are doing of a new twit to all your followers. Instead of reading the entire list of followers into memory, and then writing the status to every single one of them, let us do something different:

Redis.PushToAllListFoundIn("uid:"+user_id+"@followers", status, "{0}@posts");

I am using the .NET conventions here because otherwise I would go mad. As you can see, we instruct Redis to go to a particular list, and copy the status that we pass it to all the keys found in the list ( after formatting the key with the pattern). This gives the data store enough information about this to be able to optimize this operation considerably.

With just these few changes, I think that you gain enormously, and you retain the very simple model of using a key/value store.

More posts in "JAOO" series:

  1. (06 Oct 2010) The Go Programming Language
  2. (20 Oct 2009) More on Evolving the Key/Value Programming Model to a Higher Level from Billy Newport
  3. (07 Oct 2009) OR/M += 2
  4. (05 Oct 2009) Evolving the Key/Value Programming Model to a Higher Level
  5. (05 Oct 2009) Working Effectively with Legacy Code 2 – Michael Feathers