Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,916 | Comments: 49,398

filter by tags archive
time to read 5 min | 845 words

Currently sitting in the Go language keynote in JAOO.

Go was created to handle Google needs:

  • Compilation speed
  • Efficient
  • Fit for the current set of challenges we face

Go is meant to be Simple, Orthogonal, Succinct. Use C syntax, but leave a lot of complexity behind.

This make sense, because C/C++ is horrible for a lot of reasons, because of complexity added to it over the years. Actually, the major problem is that C/C++ mindset are literally designed for a different era. Take into account how long it can take to build a C++ application (I have seen no too big apps that had half an hour build times!), and you can figure out why Google wanted to have a better model.

Nice quotes:

  • Programming shouldn’t be a game of Simon’s Says
  • Difference between seat belts & training wheels – I am going to use that a lot.

I find Go’s types annoying:

  • [variable name] [variable type] vs. [variable type] [variable name]
  • And []int vs. int[]

I like the fact that is uses garage collection in native code, especially since if you reference a variable (including local) it will live as long as it has a reference to. Which is a common mistake in C.

It also have the notion of deferring code. This is similar to how you can use RAII in C++, but much more explicit, which is good, because RAII is always tricky. It very cleanly deals with the need to dispose things.

All methods looks like extension methods. Data is held in structures, not in classes. There is support for embedding structures inside one another, but Go isn’t an OO language.

Interfaces work for all types, and interfaces are satisfied implicitly! Much easier to work, think about this like super dynamic interfaces, but strong typed. Implies that we can retrofit things afterward.

Error handling is important, and one of the major reasons that I moved from C++ to .NET. Go have two modes, the first is error codes, which they use for most errors. This is especially nice since Go have multiple return values, so it is very simple to use. But it also have the notion of panics, which are similar to exceptions. Let us take a look at the following test code:

package main

func test() {
  panic("ayende")
}

func main() {
    test()
}

Which generate the following panic:

panic: ayende

panic PC=0x2ab34dd47040
runtime.panic+0xb2 /sandbox/go/src/pkg/runtime/proc.c:1020
    runtime.panic(0x2ab300000000, 0x2ab34dd470a0)
main.test+0x47 /tmp/gosandbox-a9aaff6c_68e6b411_26fb5255_aa397bb1_6299a954/prog.go:5
    main.test()
main.main+0x18 /tmp/gosandbox-a9aaff6c_68e6b411_26fb5255_aa397bb1_6299a954/prog.go:8
    main.main()
mainstart+0xf /sandbox/go/src/pkg/runtime/amd64/asm.s:78
    mainstart()
goexit /sandbox/go/src/pkg/runtime/proc.c:145
    goexit()

I got my stack trace, so everything is good. Error handling is more tricky. There isn’t a notion of try/catch, because panics aren’t really exceptions, but you can recover from panics:

package main
import ( 
   "fmt"
)
func badCall() {
  panic("ayende")
}

func test() {
   defer func() { 
    if e := recover(); e != nil {
       fmt.Printf("Panicing %s\r\n", e);
    }
   
    }()
    badCall()
   fmt.Printf("After bad call\r\n");
}

func main() {
   fmt.Printf("Calling test\r\n");
   test()
   fmt.Printf("Test completed\r\n");
 
 
}

Which will result in the following output:

Calling test
Panicing ayende
Test completed

All in all, if I need to write native code, I’ll probably go with Go instead of C now. Especially since the ideas that it have about multi threading are so compelling. I just wish that the windows port would be completed soon.

time to read 6 min | 1099 words

As I already mentioned, this presentation had me thinking. Billy presented a system called Redis, which is a Key/Value store which is intended for an attribute based storage.

That means that storing something like User { Id = 123, Name = “billy”, Email = “billy@example.org”} is actually stored as:

{ "uid:123:name": "billy" } 
{ "uid:123:email": "billy@example.org" }
{ "uname:billy": "123" } 

Each of those lines represent a different key/value pair in the Redis store. According to Billy, this has a lot of implications. On the advantage side, you get no schema and very easy support for just adding stuff as you go along. On the other hand, Redis supports not transactions and it is easy to “corrupt” the database during development (usually as a result of a programming bug).

What actually bothered me the most was the implications on the number of remotes calls that are being made. The problem shows itself very well in this code sample, (a twitter clone), which show inserting a new twit:

long postid = R.str_long.incr("nextPostId"); 
long userId = PageUtils.getUserID(request); 
long time = System.currentTimeMillis(); 
string post = Long.toString(userId)+"|" + Long.toString(time)+"|"+status; 
R.c_str_str.set("p:"+Long.toString(postid), post);
List<long> followersList = R.str_long.smembers(Long.toString(userId)+":followers"); 
if(followersList == null) 
   followersList - new ArrayList<Long>(); 
HashSet<long> followerSet = new HashSet<long>(followersList); 
followerSet.add(userid); 
long replyId = PageUtils.isReply(status); 
if(replyId != -1) 
   followerSet.add(replyId); 
for(long i : followerSet) 
    R.str_long.lpush(Long.toString(i)+":posts", postid); 
// -1 uid is global timeline 
String globalKey = Long.toString(-l)+":posts"; 
R.str_long.lpush(globalKey,postid); 
R.str_long.ltrim(globolKey, 200);

I really don’t like the API, mostly because it reminds me of C, but the conventions are pretty easy to figure out.

  • R is the static gateway into the Redis API
  • str_long = store ong
  • c_str_str – store string and keep it in nearby cache

The problem with this type of code is the number of remote calls and the locality of those calls. With a typical sharded set of servers, you are going to have lots of calls going all over the place. And when you get into people that have thousands and millions of followers, the application is simply going to die.

A better solution is required. Billy suggested using async programming or sending code to the data store to execute there.

I have a different variant on the solution.

We will start from the assumption that we really want to reduce remote calls, and that the system performance in the face of large amount of writes (without impacting reads) is important. The benefits of using something like Redis is that it is very easy to get started, very easy to change things around and great for rapid development mode. We want to keep that for now, so I am going to focus on a solution based on the same premise.

The first thing to go is the notion that a key can sit anywhere that it wants. In a key/value store, it is important to be able to control locality of reference. We change the key format so it is now: [server key]@[local key]. What does this mean? It means that for the previously mentioned user, this is the format it will be stored as:

{ "uid:123@name": "billy" } 
{ "uid:123@email": "billy@example.org" }
{ "uname@billy": "123" } 

We use the first part of the key (before the @) to find the appropriate server. This means that everything with a prefix of “uid:123” is known to reside on the same server. This allow you to do things like transactions on a single operation of setting multiple keys.

Once we have that, we can start adding to the API. Instead of getting a single key at a time, we can get a set of values in one remote call. That has the potential of significantly reducing the number of remote calls we will make.

Next, we need to consider repeated operations. By that I mean anything where we have a look in which we call to the store. That is a killer when you are talking about any data of significant size. We need to find a good solution for this.

Billy suggested sending JRuby script to the server (or similar) and executing it there, saving the network roundtrips. Which this is certainly possible, I think it would be a mistake. I have a much simpler solution. Teach the data store about repeated operations. Let us take as a good example the copying that we are doing of a new twit to all your followers. Instead of reading the entire list of followers into memory, and then writing the status to every single one of them, let us do something different:

Redis.PushToAllListFoundIn("uid:"+user_id+"@followers", status, "{0}@posts");

I am using the .NET conventions here because otherwise I would go mad. As you can see, we instruct Redis to go to a particular list, and copy the status that we pass it to all the keys found in the list ( after formatting the key with the pattern). This gives the data store enough information about this to be able to optimize this operation considerably.

With just these few changes, I think that you gain enormously, and you retain the very simple model of using a key/value store.

JAOOOR/M += 2

time to read 1 min | 69 words

Just finished doing this presentation, I think it went very well, although I planned to do a 45 minutes session + 15 questions but I ended up hitting the session time limit without covering everything that I wanted.

You can get the source code that I have shown in the presentation here: http://github.com/ayende/Advanced.NHibernate

You can find the PDF of the presentation here: http://ayende.com/presentations.aspx

time to read 2 min | 376 words

Billy Newport is talking about Redis, showing some of the special APIs that Redis offers.

  • Redis gives us first class List/Set operation, simplify many tasks involving collections. It is easy to get into big problems afterward.
  • Can do 100,000 operations per second.
  • Redis encourage a column oriented view, you use things like:
R.set("user:123@firstname", "billy")
R.set("user:123@surname", "newport")
R.set("uid:bewport", 123)

Ayende’s comment: I really don’t like that. No transactions or consistency, and this requires lots of remote calls. 

  • Bugs in your code can corrupt the entire data store. Causing severe issues in development.
  • There is a sample Twitter like implementation, and the code is pretty interesting, it is a work-on-write implementation.
  • List/set operations are problems. What happen when you have a big set? Case in point, Ashton has 4 million followers, work-on-write doesn’t work in this case.
  • 100,000 operations per second doesn’t mean much when a routine scenario result in millions of operations.
  • This is basically the usual SELECT N+1 issue.
  • Async approach is required, processing large operations in chunks.
  • Changing the way we work, instead of getting the data and working on it, send the code to the data store and execute it there (execute near the data).
    • Ayende’s note: That is still dangerous, what happen if you send a piece of code  to the data store and it hungs?
  • Usual problems with column oriented issues, no reports, need export tools.
  • Maybe use closures as a way to send the code to the server?

Ayende’s thoughts:

I need to think about this a bit more, I have some ideas based on this presentation that I would really like to explore more.

time to read 3 min | 425 words

I have a tremendous amount of respect to Michael Feathers, so it is a no brainer to see his presentation.

Michael is talking about why Global Variables are not evil. We already have global state in the application, removing it is bad/impossible. Avoiding global variables leads to very deep argument passing chains, where something needs an object and it passed through dozens of objects that just pass it down. We already have the notions on how to test systems using globals (Singletons). He also talks about Repository Hubs & Factory Hubs – which provide the scope for the usage of a global variable.

  • Refactor toward explicit seams, do not rely on accidental seams, make them explicit.
  • Test Setup == Coupling, excessive setup == excessive coupling.
  • Slow tests indicate insufficient granularity of coupling <- I am not sure that I agree with, see my previous posts about testing for why.
  • It is often easier to mock outward interfaces than inward interfaces (try to avoid mocking stuff that return data)
  • One of the hardest things in legacy code is making a change and not knowing what it is affecting. Functional programming makes it easier, because of immutability.
  • Seams in a functional languages are harder. You parameterize functions in order to get those seams.
  • TUF – Test Unfriendly Feature – IO, database, long computation
  • TUC – Test Unfriendly Construct – static method, ctor, singleton
  • Never Hide a TUF within a TUC
  • No Lie principal – Code should never lie to you. Ways that code can lie:
    • Dynamically replacing code in the source
    • Addition isn’t a problem
    • System behavior should be “what I see in the code + something else”, never “what I see minus something else”
    • Weaving & aspects
    • Impact on inheritance
  • The Fallacy of Restricted Languages
  • You want to rewrite if the architecture itself is bad, if you have issues in making changes rapidly, it is time for refactor the rough edges out.

FUTURE POSTS

  1. Researching a disk based hash table - 16 hours from now

There are posts all the way to Nov 14, 2019

RECENT SERIES

  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats