Ayende @ Rahien

Refunds available at head office

Designing RavenFS

What is Raven FS? Raven FS is a distributed file system designed to handle large file replication across WAN networks reliably.

What does it actually means? The scenario that we have is actually quite simple. Given that we have a file in location A and we need to have that file in location B (geo distributed) how do we move the file across the WAN? Let me make the problem slightly more interesting:

  • The file is large, we are talking about hundreds of megabytes at the low range and tens of gigabytes at the high end.
  • The two locations might be connected over WAN.
  • The connection is assumed to be flakey.

Let us consider the a few scenarios where this can be useful:

  • I have a set of videos that I would like to be edited in some fashion (say, putting Bang! and Wham! callouts in some places). Since I have zero ability in editing videos, I hire a firm in India to do that for me. The problem is that each video file is large, and just sending the files to India and back is a challenge. (Large file distributed collaboration problem)
  • I have a set of webservers where users can upload images. We need to send those images to background servers for processing, and then they need to be made available to the web servers again. The image sizes are too large to be sent over traditional queuing technologies. (Udi Dahan calls the problem the Data Bus).
  • I have a set of geo-distributed locations where I have a common set of files (think about something like scene information for rendering a game) that needs to be kept in sync. (Distributed file replication).

I have run into each of those problems (and others that fall into similar categories) several times in recent months. Enough to convince me that:

  • There is a need here that people would be willing to pay for.
  • It is something that we can provide a solution for.
  • There is a host of other considerations related to those set of problems that we can also provide a solution for. A simple example might be simple backup procedures.

The actual implementation will probably vary, but this is the initial design for the problem.

A RavenFS node is going to be running as an HTTP Web Server. That removes a lot of complexity from our life, since we can utilize a lot of pre-existing protocols and behaviors. HTTP already supports the notion of partial downloads / parallel uploads, (Range, If-Range, Content-Range), so we can re-use a lot of that.

From an external implementation perspective, RavenFS node exposes the following endpoints:

  • GET /static/path/to/file <- get the file contents, optionally just a range
  • PUT /static/path/to/file <- put file contents, optionally just a range
  • DELETE /static/path/to/file  <- delete the file
  • GET /metadata/path/to/file <- get the metadata about a file
  • GET /browse/path/to/directory <- browse the content of a directory
  • GET /stats <- number of files, current replication efforts, statistics on replication, etc

A file in RavenFS consists of:

  • The file name
  • The file length
  • A sequence of bytes that makes up the file contents
  • A set of key/value properties that contains file metadata

Internally, files are stored in a transactional store. Each file is composed of pages, each page is a maximum of 4 MB in size and is identified by its signature. (Actually, a pair of hash signatures, probably SHA256 & RIPEMD160, to avoid any potential for collision). The file contents are actually the list of pages that it is composed of.

The notion of pages is pretty important for several reasons:

  • It provides us with a standard way to identify pieces of the files.
  • Each page may be part of multiple files.
  • Pages are immutable, once they are written to storage, they cannot be modified (but they can be removed if no file is referencing this page).
  • It makes it easier to chunk data to send while replicating.
  • It drastically reduces the size taken by files that share much of the same information.

Let us try to analyze this further. Let us say that we have a 750MB video, we put this video file inside RavenFS. Internally, that file is chunked into 188 pages, each of them about 4 MB in size. Since we have setup replication to the RavenFS node in India, we start replicating each of those pages as soon as we are done saving it to the local RavenFS node. In other words, even while we are uploading the file to the local RavenFS node, it is being replicated to the remote RavenFS nodes, saving us the need to wait until the full file is loaded for replication to begin. Once the entire file has been replicated to the remote node, the team in India can start editing that file.

They make changes in three different places, then save the file again to RavenFS. In total, they have modified 24 MB, and in total modified 30 pages. That means that for the purpose of replicating back to the local RavenFS node, we need to send only 120 MB, instead of 750 MB.

This reduces both time and bandwidth required to handle replication. The same will happen, by the way, if we have a set of common files that have some common parts, we will not store the information twice. For that matter, the RavenFS client will be able to ask the RavenFS node about pages that are already stored, and so won’t need to even bother uploading pages that are already on the server.

Another important factor in the decision to use pages is that when replicating across unreliable medium, sending large files around in a single chunk is a bad idea, because it is pretty common for the connection to drop, and if you need a prefect connection for the duration of the transfer of a 1.5 GB file, you are going to be in a pretty bad place very soon.

Thoughts?

Comments

Sebastian Burgstaller
05/01/2011 09:26 AM by
Sebastian Burgstaller

Pretty interesting stuff! As you stated, this could also be interesting for (versioning) backup system of privat data. Keep up the good work!

Ken Egozi
05/01/2011 09:39 AM by
Ken Egozi

I'm not a digital video expert, but I'd think that if the Indian company edited the video, then it would be effecting the size of the encoded video of that part, so actually all of the pages that are after the first edit will be different.

Not only that, it is possible that the video envelope contains metadata in the beginning of the file (for fast-start, error corrections, what-not) and in that case, any change will require the whole edited file to be re-sent.

Apart from that - this is a very cool initiative and I'd be keeping a tab on the implementation.

Eugene Kirpichov
05/01/2011 09:41 AM by
Eugene Kirpichov

I'm not sure the "files with common parts" thing will work - the files might have common content, but it seems less likely that this content will be perfectly aligned (except in the beginning), so that chunking into 4mb or even 4kb pages will detect this commonality.

Ken Egozi
05/01/2011 09:52 AM by
Ken Egozi

nothing beats experience so if id did work well for a similar case then I guess my concern is void.

Ayende Rahien
05/01/2011 09:55 AM by
Ayende Rahien

Ken & Eugene,

You need to actually consider the way you are working with very large files.

You can't easily move things in a file. Considering a file of 100 MB in size.

If you want to insert a single byte at position 2, you are going to have to copy all of the bytes after that. Which, assuming your buffer size if 4096 bytes, would take over 25 K operations!

For small files, you typically don't even notice that, since you are going to be writing the entire file from scratch.

But if you are working with very large files, you really care about that. When working with large binaries, most software would either:

  • Modify a region of space which is fixed sized.

  • Append to the end of the file.

In either case, this works out pretty well for my purposes.

Rik Hemsley
05/01/2011 10:21 AM by
Rik Hemsley

You just invented rsync. Of course, rsync doesn't work too well on Windows - and it's not the easiest thing to work with from within .NET, so I think you did the right thing.

Did you look at how rsync works? I'd be interested to know if you used similar algorithms.

Ayende Rahien
05/01/2011 10:36 AM by
Ayende Rahien

Rik,

I am aware of rsync, sure.

Actually reading up on it now to see if I can get it working better :-)

There are some additional benefits that I see (the notion of multi master, for example), but the basic idea is the same, yes.

The implementation would be nicer, at least I hope so :-)

Ayende Rahien
05/01/2011 10:37 AM by
Ayende Rahien

Joshka,

Single byte transmissions would merely necessitate re-sending the page, that is acceptable.

The actual problem is what happen if I add a single byte at the beginning of the file.

Joshka
05/01/2011 10:50 AM by
Joshka

Sorry, I meant single byte additions. If I'm not mistaken, the rolling checksum algorithm used by rsync handles this.

Your proposed algorithm matches Tridge's first attempt at an algorithm. See p51 of his thesis paper on the rsync algorithm at http://samba.org/~tridge/phd_thesis.pdf

Ayende Rahien
05/01/2011 10:56 AM by
Ayende Rahien

Joshka,

That is for a reason. The Rsync protocol requires the destination to read the entire file in order to compute the differences.

That is going to be unreasonably costly with very large files, the very reason that RavenFS exists.

configurator
05/01/2011 10:56 AM by
configurator

From what I've seen lots (not all or even most, but lots) of software loads the entire file even when it's large. What you should do is make each page have its own size which is limited by 4MB. Then when a byte (or a kilobyte) is added, you add another, smaller, page.

Even if you use hashes, collisions are still possible; using a generated id as well as the hash will fix the bug which you'll only ever run into once but will waste two weeks of your life because you wouldn't be able to reproduce it...

Your file transfer methods sound suspiciously like BitTorrent. Any chance you could use that to ease your work?

Ayende Rahien
05/01/2011 11:00 AM by
Ayende Rahien

Configurator,

It doesn't really happen for the large files. I am talking about hundreds of MB as a low point.

Considering the fact that most file structures have been around for a while, trying to load a 500 MB file to memory would usually force system paging and unacceptable performance in most systems even as little as a 7 years ago.

For large files, there are special considerations in place that takes care of that.

I don't understand the notion of separate size for each page, can you explain?

As for collisions, that is why I specified the usage of two separate hash functions. I accept the potential for collisions in one hash function, but the chance of identical collisions in separate hash functions is... remote.

I considered using BitTorrent, but right now I am probably going to just implement this myself.

Duarte Nunes
05/01/2011 11:24 AM by
Duarte Nunes

You have that problem every time you add data, right? If you add some bytes to the middle of a full page, you have to get btree-y and create two new pages (or one and invalidate some of the contents of the original). I guess you have to account for partially filled pages and store its actual size, zeroing the rest (on most file systems, zeroed blocks don't take up actual physical blocks).

Ori
05/01/2011 11:27 AM by
Ori

This idea would be perfect for todays online games, dont you think? sometimes updates reach into the GB's, which can be a real issue for people not living in places with exta fast connections.

Itay
05/01/2011 11:41 AM by
Itay

Another thing to consider, relating to common pages: If you have a page that is common to 2 files. Now you change one of the files, which result in a change in the common page. You can't just change the common page (because that would result in the 2nd file changing as well) - you have to create a new page, and point the modified file to the new page, instead of using the common page.

Things might be more complex, if you then change the file back to it's original page - you now have 2 similar pages, instead of one common page.

tobi
05/01/2011 11:48 AM by
tobi

This is a great idea because it fits nicely into your brand: Reliable ans simple infrastructure for real applications.

Deduplication: I would go with smaller pages because the overhead with say 512kb pages will still be nill.

Hash: There is no chance of collision to worry about (need 2^80 pages for SHA160 to habe 50% chance). Also malicious attacks are currently out of reach.

Deduplication 2: There is a very fast yet very effective deduplication app called http://www.exdupe.com/, tested it with huge SQL Server backups. Maybe you can get the algorithm and improve on a simple one.

Ayende Rahien
05/01/2011 11:53 AM by
Ayende Rahien

Duarte,

You would generally not write directly into RavenFS, we are usually talking about upload processes, not in-place edits.

Ayende Rahien
05/01/2011 11:54 AM by
Ayende Rahien

Itay,

Pages are immutable once created, if there are no references to them, they 'll be deleted.

Ayende Rahien
05/01/2011 11:58 AM by
Ayende Rahien

Tobi,

Thanks.

It is probably going to take a few tries to figure out the optimal page size.

With collisions, I agree that the chance is minimal, by it is also an issue of marketing rather than just technical perspective.

As for exdupe, they seem to be commercial, and as such I can't really look too deeply into what they are doing.

Ryan Heath
05/01/2011 12:04 PM by
Ryan Heath

Since you store the filecontent inside ravenfs how are you going to give access to the files to the video editing software, transparently?

Do you need to write a (Windows)FS driver?

// Ryan

John Simons
05/01/2011 12:06 PM by
John Simons

This sounds a lot like DFSR in Windows!

Ayende Rahien
05/01/2011 12:08 PM by
Ayende Rahien

Ryan,

No, usually people download the file locally (fast, since they have a node nearby).

RavenFS is usually meant to be used alongside with content management systems or as a read / write only store, not as the actual edit store.

Ryan Heath
05/01/2011 12:24 PM by
Ryan Heath

Hmm, so after editing has finished the user needs to upload the file to ravenfs again by himself?

It would be rather nice to have that done transparently for the user.

But I can also imagine that could increase bandwidth since each and every update would be replicated ...

// Ryan

Ayende Rahien
05/01/2011 12:28 PM by
Ayende Rahien

Ryan,

Consider the case of editing a video file.

I download that from the local RavenFS system. Then I start working on it. Assume that we had transparent saves to RavenFS.

When would we upload? Every time the the file changes? That might cause us to catch the file while it is being updated.

Whenever the user saves? We have no real way of telling when the user have actually saved the file or when the video editing program is modifying it for whatever purposes it has.

It is the same reason why you have a separate Commit notion in source control.

Ayende Rahien
05/01/2011 12:36 PM by
Ayende Rahien

Jan,

This isn't a mistake. It is a valid way to save space. Please note that RavenFS has its own internal security measures, but it is meant as a shared storage for a single organization, not as a location for storing information for different users, where they shouldn't be able to share anything between them.

Huberto Kusters
05/01/2011 12:38 PM by
Huberto Kusters

Ayende,

It would be great if there were a client-part that would make RavenFS accessible via the Explorer (in Windows). To implement this feature, you could take a look at http://dokan-dev.net/en/about/

Ayende Rahien
05/01/2011 12:40 PM by
Ayende Rahien

Huberto,

That is very interesting, and I'll look into that in the future, thanks.

Uriel Katz
05/01/2011 01:12 PM by
Uriel Katz

I really like the idea of replication during upload instead of after upload.

Do you really need the option to alter files(i.e. write in some range of the file) or just upload a new version that might include similar data?(then you can do COW for the common pages).

it sounds very similar to S3/cloudfiles + metadata for files.

I did just that in Binfire.com,you upload to us(a python gevent daemon handles the upload) and we upload it to cloudfiles as we get the data from the users.

Ayende Rahien
05/01/2011 01:17 PM by
Ayende Rahien

Uriel,

Pages are immutable, if you want to update a page, you need to send a new version of it.

That makes keeping versions around very cheap.

Ryan Heath
05/01/2011 01:27 PM by
Ryan Heath

@ayende

Now you've mentioned 'commit' and Huberto's suggestion I think you may want a UI which shows which files are the 'latest version' or need to be 'committed'.

Is that an idea?

// Ryan

configurator
05/01/2011 01:32 PM by
configurator

Ayende, for different page sizes consider this example.

(Maximal) page size = 4.

File on server: [1,2,3,5], [6,7,8,9], [10,11,12,13]

We add a byte, 4.

File on server will now be: [1,2,3], [4,5], [6,7,8,9], [10,11,12,13]

Note that only the changed page needs to be sent - the last two pages are untouched. However, the pages need to be able to have different sizes.

Regarding your data-in-big-files-doesn't-move hypothesis: I've used video editing software more than once where a change (e.g. cut an ad out of the middle/beginning) caused the entire file to move by an odd (i.e. strange) number of bytes, although there were very few changes inside the file (I checked with a binary diff). What happened was the change was quick, but saving the file was excruciating...

Jeff Lewis
05/01/2011 02:27 PM by
Jeff Lewis

I've been looking for something like this for some time. I've said that what I want is S3 that I can host locally. My requirements are slightly different though, so I'll list them in hopes that some might get included:

-most of the files I deal with are less that a Meg, but can be much larger depending on the attachments.

-must allow immediate consistency. Reads often immediately follow writes and must be up to date. When unable to sync between repositories, consistency between repositories can/must be relaxed so that system remains active

-multiple copies: I'd like to have 2 copies of files locally and one, or more, thousands of miles away

-no single point of failure: if local repos is down, writes and reads go to remote repositories

-catch-up: if the connection to remote repositories goes down, when restored, changes should be replicated

Ayende Rahien
05/01/2011 02:43 PM by
Ayende Rahien

Configurator,

That would actually be pretty hard to do. How do you detect this change?

How do you manage the next change?

As I said, it is actually a lot of work to do, and would result in pages that are 1 byte long eventually.

As for video editing software, that really depend on the software and the format. As you noted, saves are painful, so most software moves to a way to append mode / fixed size.

Ayende Rahien
05/01/2011 02:46 PM by
Ayende Rahien

Jeff,

1) That would be handled, sure. We are thinking about large files, but small ones would be supported.

2) I assume you meant writes followed by reads, in which case, the local RavenFS is fully transactional.

3) Replication is more an issue of configuration & reliability, than anything else. Assuming that you have configured RavenFS to replicate to a remote node, it will do so as long as it is able.

4) Failover to a remote repository is possible, I guess. We will probably not handle that by default, because the cost of accessing very large files remotely can be very big. But I'll make sure to abstract that to a strategy that you can override.

5) Catch up - already handled.

Thomas Krause
05/01/2011 03:11 PM by
Thomas Krause

I would suggest looking at BitTorrent a bit more as a protocol option...

It already does a lot of what you want to achieve uses http and it supports multiple peers synchronizing with each other very efficiently.

There is certainly a business model packaging the protocol into an application which is more suitable to synchronize servers etc.

joshka
05/01/2011 05:29 PM by
joshka

Ayende, the rsync paper details algorithms to calculate the best block size. It doesn't require a full file read at destination on every upload, only the blocks where there is a change. Calculating the checksums can be done once, either client or server side. It also uses two "hashes" - md5 and a rolling checksum and handles the 1 byte change problem. In effect the Delta of that 1 byte addition situation is a transfered as that byte and 25k acks. Your method is 1byte plus 100mb xfer. The paper is worth reading regardless of whether you use the same method. It also mentions potential HTTP implementations and related tools e.g rsdiff

Ayende Rahien
05/01/2011 05:43 PM by
Ayende Rahien

Joshka,

In order to calculate the rolling hash, you have to read the entire file.

If you want to cache the rolling checksum, you would have to store a 32 bits for every bytes, which is.. unadvisable.

Without doing this on a byte boundary, you are vulnerable to missing the "one byte changed at beginning of file".

I read the rsync papers, they are facinating, but they also detail something with very high cost to solve the general problem in very large files.

Show me a solution for a single byte addition in a 500 MB file. I admit that I find the rsync approach both insightful and brilliant, but I don't really see a way to make it work for my case.

I would be very happy to be proven wrong, mistaken and stupid.

Markus
05/01/2011 09:43 PM by
Markus

Ayende,

Have you evaluated SynchronEX? I haven't looked at what kind of algorithm it uses, but wanted to mention it just in case you haven't yet stumbled upon it. http://www.xellsoft.com/SynchronEX.html

Thanks for a great blog!

joshka
05/01/2011 10:39 PM by
joshka

Re caching, I believe it's only necessary for the receiver to store the checksum per block, not per byte. The cache of those bytes only becomes invalid when byte insertion occurs.

joshka
05/02/2011 02:09 AM by
joshka

... becomes invalid for al blocks after the insertion point...

I don't understand the point about having to read the entire file in this context. In your original proposed solution, an insertion would require a full recalculation of all future block hashes. A store operation would require transmission of all blocks and hence the effective read of the entire file.

Ayende Rahien
05/02/2011 06:17 AM by
Ayende Rahien

Joshka,

No, because you need to do a per byte rolling hash to check for the insertion case.

Ayende Rahien
05/02/2011 06:18 AM by
Ayende Rahien

Joshka,

The difference is where this is happening, client side / server side.

I am actually thinking of doing a reverse mode, where we ask the server for what it has, then do the change calculation on the client side.

Still thinking on this.

Joshka
05/02/2011 07:58 AM by
Joshka

Here's a simplified example with a block size of 4 (obviously the overhead of the checksums is way too high).

receiver: ABCD EFGH IJKL

Example receiver rolling block checksums:

sum(ABCD)=11 -> Receiver sends 'sum(block 1) is 11'

sum(EFGH)=22 -> Receiver sends 'sum(block 2) is 22'

sum(IJKL)=33 -> Receiver sends 'sum(block 3) is 33'

sender: 0ABCDEFGHIJKL

sender rolling byte checksums:

sum(0ABC) = 00 -> sender sends '0'

sum(ABCD) = 11 -> sender sends ' block 1 match found'

sum(EFGH) = 22 -> sender sends ' block 2 match found'

sum(IJKL) = 33 -> sender sends ' block 3 match found'

This demonstrates that the receiver only needs to calculate / store / know the block running checksum, not a byte running checksum.

The checksums can be calculated during upload time - i.e. a point when you'll have a copy of the file streaming through your machines memory on its way to persistent storage. You'd need to do this anyway for your proposal.

Ayende Rahien
05/02/2011 08:06 AM by
Ayende Rahien

Joshka,

Yes, that is why I referred to this as reversed rsync, since this is exact opposite of how it works.

In rsync, it is the reciever who does all of the hashing and matching

Joshka
05/02/2011 08:22 AM by
Joshka

Ayende,

I'm missing something here. What I wrote above is a pictorial example of the algorithm on p52. In the article, 'A' is the sender and 'B' is the receiver.

Ayende Rahien
05/02/2011 08:25 AM by
Ayende Rahien

Joshka,

Then I might have got it wrong, because my understanding was that the sender sent the hashes per each fixed output, and the reciever than sent back the matches.

Jimmy Shimizu
05/02/2011 11:52 AM by
Jimmy Shimizu

Interesting project. I have been looking into similiar systems myself and just wanted to tip you about http://www.xtreemfs.org/ which might incorporate some of the features you are aiming for.

Jason Hurdlow
05/02/2011 11:32 PM by
Jason Hurdlow

Video files are actually a bad example. When editing video, you don't typically edit the original video files at all, but rather build a set of edits & effects (saved in a scene file of some sort) that are then rendered to a new file. So once you had a copy of the large video files they could be read-only without issue. No video editor in their right mind would overwrite or modify their original footage files. The rendered output might change though, and that could be applicable here.

Francisco A. Lozano
05/02/2011 11:39 PM by
Francisco A. Lozano

What about random access? both read/write, either for modifications or for appends... I've been looking for a solution to this without much success

Ayende Rahien
05/03/2011 04:47 AM by
Ayende Rahien

Jason,

That is pretty much my point.

Ayende Rahien
05/03/2011 04:47 AM by
Ayende Rahien

Francisco,

Random access will be supported., yes.

jalchr
05/04/2011 08:39 AM by
jalchr

I like the idea, but I do think it is already implemented in peer-to-peer applications like emule and bittorrent ...

http://sourceforge.net/projects/emule/

They utilize http and they handle GBs of files ... in very unreliable medium of communication.

I'm sure you can make it better ...

Ayende Rahien
05/04/2011 08:57 AM by
Ayende Rahien

Jalchr,

I am looking at those as well, but the way we are expecting to do things are quite different.

For example, eMule / BitTorrent assumes that the file doesn't exists in the other end, while I assume that it most probably does

jalchr
05/04/2011 08:58 AM by
jalchr

MonoTorrent (C#)

A cross platform open source .NET Framework based BitTorrent Client written in C#. MonoTorrent is a cross platform and open source implementation of the BitTorrent protocol. It supports many advanced features such as Encryption, DHT, Peer Exchange, Web Seeding and Magnet Links. Frontends: Curses TUI, Gtk GUI, WinForms GUI

Homepage: projects.qnetp.net/projects/show/monotorrent
Source: anonsvn.mono-project.com/viewvc/trunk/bitsharp/
WinForms GUI homepage: http://code.google.com/p/monotorrent/
Blog: http://monotorrent.blogspot.com/

Monsoon (C#)

Monsoon Project is a GTK+ BitTorrent client based on C# and MonoTorrent.

Aaron Olson
05/04/2011 12:57 PM by
Aaron Olson

What kind of streaming scenarios will RavenFS support? In particular, could one client be streaming a file into RavenFS while another client reads the same file simultaneously?

Thanks!

Ayende Rahien
05/04/2011 01:13 PM by
Ayende Rahien

Aaron,

Yes, that will be supported.

Torvin
05/05/2011 04:25 PM by
Torvin

I personally think

GET /path/to/file/metadata

is better than

GET /metadata/path/to/file

because '/path/to/file' can't be both a file and a folder on most filesystems, so we won't confuse 'metadata' with a file. And it looks nicer that way :3

Comments have been closed on this topic.