Ayende @ Rahien

Refunds available at head office

Career planning: Mine

I got some really good questions about my career. Which caused me to reflect back and snort at my own attempts to make a sense of what happened.

Here is a rough timeline:

  • 1999 – Start of actually considering myself to be a professional developer. This is actually the start of a great one year course I took to learn C++, right out of high school.
  • 2001 – Joined the army, was sent to the Military Police, and spent 4 years in prison. Roles ranged from a prison guard, XO of big prison, teacher in officer training course and concluded with about a year as a small prison commander.
  • 2004 – Opened my blogged and started writing about the kind of stuff that I was doing, first version of Rhino Mocks.
  • 2005 – Joined the Castle Comitter’s team, Left the army, joined We!, worked as a consultant.
  • 2006 – My first international conference – DevTeach.
  • 2008 – Left We!, started working as an independent consultant.
  • 2009 – NHibernate Profiler beta release.
  • 2010 – DSLs in Book book is published, Entity Framework Profiler, Linq to SQL Profiler, RavenDB.
  • 2011 – Hiring of first full employee.
  • 2014 – Writing this blog post.

A lot of my history doesn’t make sense without a deeper understanding. In the army, there was a long time in which I wasn’t actually able to do anything with computers. That meant that on vacations, I would do stuff voraciously. At that time, I already read a lot of university level books (dinosaurs book, Tanenbaum’s book, TCP/IP, DDD book and a lot of other stuff like that). At some point, I got an actual office and had some free time, so I could play with things. I wrote code, a lot. Nothing that was actually very interesting. It was anything from mouse tracking software (that I never actually used) to writing custom application to track medical data for inmates. Mostly I played around and learned how to do stuff.

When I got to be a prison commander, I also got myself a laptop, and start doing more serious stuff. I wasn’t able to afford any professional software, so it was mostly Open Source work. I started working on NHibernate Query Analzyer, just to see what I can do about it. That thought me a lot about reflection and NHibernate itself. I then got frustrated with the state of mocking tools in the .NET framework, and I decided to write my own. Around that time, I also started to blog.

What eventually became Rhino Mocks was a pretty big thing. Still one of the best pieces of software that I have written, it required that I’ll have a deep understanding of a lot of things. From IL generation to how classes are composed by the runtime to AppDomains to pretty much everything.

Looking back, Rhino Mocks was pretty huge in terms of pushing my career. It was very widely used, and it got me a lot of attention. After that, I was using NHibernate and talking about that a lot, so I got a lot of reputation points in that arena as well. But the first thing that I did after starting as an independent consultant was to actually work on SvnBridge. A component that would allow an SVN client to talk to a Team Foundation Server. That was something that I had very little experience with, but I think that I did a pretty good job there.

Following that, I mostly did consulting and training on NHibernate. I was pretty busy. So busy that at some point I actually have a six week business trip that took me to five countries and three continents. I came back home and didn’t leave my bed for about a week. For two weeks following that, I would feel physically ill if I sat in front of the computer for more than a few minutes.

That was a pretty big wakeup call for me. I knew that I had to do something about it. That is when I actually sat down and thought about what I wanted to do. I knew that I wanted to stay in development, and that I couldn’t continue being a road warrior without burning out again. I decided that my route would be to continue to do consulting, but on a much reduced frequency, and to start focusing on creating products. Stuff that I could work on while at home, and hopefully get paid for. That is how the NHibernate Profiler was born.

From there, it was a matter of working more on that and continuing to other areas, such as Entity Framework, Linq to SQL, etc. RavenDB came because I got tired of fixing the same old issues over and over again, even with the profilers to help me. And that actually had a business plan, we were going to invest so much money and time to get it out, and it far exceeded our expectations.

Looking back, there were several points that were of great influence. Writing my blog. Writing Rhinos Mocks, joining open source projects such as Boo or Castle. Working and blogging with NHibernate. Going to conferences and speaking there. All of those gave me a lot of experience and got me out there, building reputation and getting to know people.

That was very helpful down the road, when I was looking for consultancy jobs, or doing training. Or when it came the time to actually market our software.

In terms of the future, Hibernating Rhinos is growing at a modest rate. Which is the goal. I don’t want to double every six months, that is very disturbing to one’s peace of mind. I would much rather have a slow & steady approach. We are working on cool software, we are going home early and for the most part, there isn’t a “sky is falling” culture. The idea is that this is going to be a place that you can spend a decade or four in. I’ll get back to you on that when I retire.

Career planning: Disaster recovery

One of the more important things that you have to remember is that you should always be ready for failure. As developers, we are used to thinking about stuff like that in our code, but this is true for real life as well.

I’m going to leave aside things like personal disasters for this post (things like car accidents, getting seriously sick, etc), because there are some ways to mitigate those (insurance, family, etc) and they really isn’t anything special in development to say about those. Instead, I want to talk about professional disasters.

Those can be things like:

  • Company closing (nicely or otherwise).
  • Getting fired.
  • Product going under.
  • Product doing badly.
  • Reputation smear.
  • High profile failure.

Let me try take them in turn. The easiest one to handle is probably a company closing down, there is very little blame attached here, so there shouldn’t be an issue of having a new job. This is also the time to consider if you want to move tracks to be an independent or entrepreneur. Getting fired is a bit harder, but assuming that you weren’t fired for cause (such as negligence of criminal behavior), the old “everyone is downsizing” is going to work.

Even in a so-so economy, there is still a lot of jobs out there for software developers, but getting a good one might require you to polish your skills, and getting good idea of what is marketable today. Note that there is a big difference between what is popular and what is marketable (as in, will land you a job). Node.JS seems to be the buzzword of the day, but knowing Java very well is probably a much better path for quick employment.

This comes back to what kind of approach you want to take. For now, I’m going to assume that the fallback position for a good developer is to get hired in some fashion, it can be a short term contract, or just be gainfully employed writing software. This is important when we consider the other things that can happen, those are the kind of disasters that strike when you are more than just an employee. If you are entrepreneur and your product is just loosing too much money, for example, what is your next path?

The easy thing is when you know that you can’t go on. Maybe a competitor is pricing you out of the market, or the bank is closing the credit line or you can’t get more client or any of a hundred reasons. You are done, and you are well aware of that. A much harder issue is when you are just doing badly. So you do make sales, but not enough to cover expenses, or just getting by. Not enough to bankrupt you immediately, but you can see it happening. Unless  something changes… So you have the option of pulling the cord or trying to get it to work, with the chance of going to actual failure.

For a startup, you usually don’t have to deal with those details, but you might just show up and the company is closing down.  In those cases, there is usually not much that can be done by you (unless you are the founder, in which case, there is a wealth of information on that issue out there).

The last issue that you need to take into account is how to deal with reputation damage or a high profile failure. That depend on what the actual issue is. If it is a high profile arrest for doing coke, it might be hard to get / retain clients. If it is a big failure that cost a customer a lot of money, you might be dealing with legal consequences as well as the actual damage with other customers.

We can simplify how we look at this if we treat it all as the same thing, just a basic setback to zero (or negative). The issue is how to recover and move on. At this point, the issue is what sort of future you want? Setbacks like that are a great reason to do some thinking about where you want to go and what you want to actually do.

The conservative choice would be to find a job as a full time developer of some kind, since that at least give you a steady paycheck for the duration. More complex can be the decision to do contracting, either on a short term (at worst you can be a Word Press consultant and install that to people) or longer term projects (which require you to actually sale yourself). Hopefully you won’t be doing someone’s else homework, at least not for long.

Note that actually being able to recover from a disaster properly require prior planning. Do you have resources to survive a duration with no money? Can you handle (mentally) being out of work? Are you running on the razor’s edge of a single disruption in money flow causes an utter collapse. If that is the case, your disaster planning is going to focus on just getting reserves to handle any hiccups, versus actually managing an actual disaster.

Oh, and of course, you need to consider the cost of disaster planning. It is all very well to build a bunker to survive atomic war and the zombie rampage, but it isn’t that good if it also bankrupt you on its own.

The general recommendation is to stay current, so it would be easier to hire you, and have some idea about what to do if you wake up one day, and for whatever reason, showing up for work is not going to happen.

Career planning: What is your path?

I got a lot of really great answers about my “Where do old developers go?”, I’m feeling much better about this now Smile.

Now let turn this question around, instead of asking what is going on in the industry, let’s check what is going on with you. In particular, do you have a career plan at all?

An easy way to check that is asking: “What are you going to do in 3 years, in 7 years and in 20 years from now?”

Of course, best laid plans of mice and men will often go awry, plans for the futures are written in sand on a stormy beach and other stuff like this. Any future planning has to include the caveats that they are just plans, with reality and life getting in the way.

For lawyers*, the career path might be: Trainee, associate, senior associate, junior partner, partner, named partner. (* This is based solely on seeing some legal TV shows, not actual knowledge.) Most lawyers don’t actually become named partners, obviously, but that is what you are planning for.

As discussed in the previous post, a lot of developers move to management positions at some point in their careers, mostly because salaries and benefits tend to flat line after about ten years or so for most people in the development track. Others decide that going independent and becoming consultants or contractors is a better way to increase their income. Another path is to rise in the technical track in a company that recognize technical excellence, those are usually pure tech companies, it is rare to have such positions in non technical companies. Yet another track that seems to be available is the architect route, this one is available in non tech companies, especially big ones. You have the startup route, and the Get Rich Burning Your Twenties mode, but that is a high risk / high reward, and people who think about career planning tend to avoid such things unless carefully considered.

It is advisable to actually consider those options, try to decide what options you’ll want to have available for you in the next 5 – 15 years, and take steps accordingly. For example, if you want to go in the management track, you’ll want to work on thinks like people’s skill, be able to fluently converse with the business in their own terms and learn to play golf. You’ll want to try to have leadership positions from a relatively early start, so team lead is a stepping stone you’ll want to get to, for example. There is a lot of material on this path, so I’m not going to cover this in details.

If you want to go with the Technical Expert mode, that means that you probably need to grow a beard (there is nothing like stroking a beard in quite contemplation  to impress people). More seriously, you’ll want to get a deep level of knowledge in several fields, preferably ones that you can tie together into a cohesive package. For example, networks expert would be able to understand how TCP/IP work and be able to actually make use of that when optimize an HTML5 app. Crucial at this point is also the ability to actually transfer that knowledge to other people. If you are working within a company, that increase the overall value you have, but a lot of the time, technical experts would be consultants. Focusing on a relatively narrow field gives you a lot more value, but narrow your utility. Remember that updating your knowledge is very important. But the good news is that if you have a good grasp of basics, you can get to grips with new technology very easily.

The old timer mode fits people who work in big companies and who believe that they can carve a niche in that company based on their knowledge of the company’s business and how things actually work. This isn’t necessarily the one year experience, repeated 20 times, although in many cases, that seems to be what happens. Instead, it is a steady job with reasonable hours, and you know the business well enough and the environment in which you are working with, that you can just get things done, without a lot of fussing around. Change is hard, however, because those places tend to be very conservative. Then again, you can do new systems in whatever technology you want, at a certain point (you tend to become the owner of certain systems, you’ve been around longer than the people who are actually using the system). That does carry a risk, however. You can be fired for whatever reason (merger, downsizing, etc) and you’ll have hard time finding equivalent position.

The entrepreneur mode is for people who want to build something. That can be a tool or a platform, and they create a business selling that. A lot of the time, it involve a lot of technical work, but there is a huge amount of stuff that needs to be done that is non technical. Marketing and sales, insurance and taxes, hiring people, etc. The good thing about this is that you usually don’t have to have a very big investment in your product before you can start selling it. We are talking about roughly 3 – 6 months for most things, for 1 – 3 people. That isn’t a big leap, and in many cases, you can handle this by eating some savings, or moonlighting. Note that this can completely swallow your life, but you are your own boss, and there is a great deal of satisfaction around building a product around your vision. Be aware that you need to have contingency plans around for failure and success. If your product become successful, you need to make sure that you can handle the load (hire more people, train them, etc).

The startup mode is very different than the entrepreneur mode. In startup, you are focused on getting investments, and the scope is usually much bigger. There is less of a risk financially (you usually have investors for that), but there is a much higher risk of failure, and there is usually a culture that consider throwing yourself on hand grenade advisable. The idea is that you are going to burn yourself on both ends for two to four years, and in return, you’ll have enough money to maybe stop working altogether. I consider this foolish, given the success rates, but there are a lot of people who consider that to be the only way worth doing. The benefits usually include a nice environment, both physically and professionally, but  it comes with the expectation that you’ll stay there for so many hours, it is your second home.

There are other modes and career paths, but now I’ve to return to my job Smile.

Career planning: Where do old devs go to?

We are pretty much always looking for new people, what is holding us back from expanding even more rapidly is the time that it takes to get to grips with our codebases and what we do here. But that also means that we usually have at least one outstanding job offer available, because it takes a long time to fill it. But that isn’t the topic for this post.

I started programming in school, I was around 14 / 15 at the time, and I picked up a copy of VB 3.0 and did some fairly unimpressive stuff with it. I count my time as a professional since around 1999 or so. That is the time when I started actually dedicating myself to learning programming as something beyond a hobby. That was 15 years ago.

When we started doing a lot of interviews, I noticed that we had the following pattern regarding developers’ availabilities:

image

That sort of made sense, some people got into software development for the money and left because it didn’t interest them. From the history of Hibernating Rhinos, one of our developers left and is now co-owner in a restaurant, another is a salesman for lasers and other lab stuff.

However, what doesn’t make sense is the ratio that I’m seeing. Where are the people who have been doing software development for decades?

Out of the hundreds of CVs that I have seen, there has been less than 10 that had people over the age of 40. I don’t recall anyone over the age of 50. Note that I’m somewhat biased to hire people with longer experience, because that often means that they don’t need to learn what goes under the hood, they already know.

In fact, looking at the salary tables, there actually isn’t a level of higher than 5 years. After that, you have a team leader position, and then you move into middle management, and then you are basically gone as a developer, I’m guessing.

What is the career path you have as a developer? And note that I’m explicitly throwing out management positions. It seems that those are very rare in our industry.

Microsoft has the notion of Distinguished Engineer and Technical Fellow, for people who actually have decades of experience. In my head, I have this image of a SWAT team that you throw at the hard problems Smile.

Outside of very big companies, those seem to be very rare.  And that is kind of sad.

In Hibernating Rhinos, we plan to actually have those kind of long career paths, but you’ll need to ask me in 10 – 20 years how that turned out to be.

Tags:

Published at

Originally posted at

Comments (74)

Mono frustrations

I’m porting Voron to Mono (currently testing on Ubuntu). I’m using Mono 3.2.8, FWIW, and working in MonoDevelop.

So far, I have run into the following tangles that are annoying.  Attempting to write to memory that is write protected results in null reference exception, instead of access violation exception. I suspect that this is done because NRE is generated on any SIGSEGV, but that led me to a very different path of discovery.

Also, consider the following code:

using System.IO;
using System.IO.Compression;

namespace HelloWorld
{
class MainClass
{

public static void Main (string[] args)
{
new ZipArchive (new MemoryStream ());
}
}
}

This results in the following error:

Unhandled Exception:
    System.NotImplementedException: The requested feature is not implemented.
        at HelloWorld.MainClass.Main (System.String[] args) [0x00006] in /home/ayende/HelloWorld/HelloWorld/Program.cs:11
        [ERROR] FATAL UNHANDLED EXCEPTION: System.NotImplementedException: The requested feature is not implemented.
        at HelloWorld.MainClass.Main (System.String[] args) [0x00006] in /home/ayende/HelloWorld/HelloWorld/Program.cs:11

This is annoying in that it isn’t implemented, but worse from my point of view is that I don’t see any ZipArchive in the stack trace. That made me think that it was my code that was throwing this.

Tags:

Published at

Originally posted at

Comments (16)

Optimizing event processing

During the RavenDB Days conference, I got a lot of questions from customers. Here is one of them.

There is a migration process that deals with event sourcing system. So we have 10,000,000 commits with 5 – 50 events per commit. Each event result in a property update to an entity.

That gives us roughly 300,000,000 events to process. The trivial way to solve this would be:

foreach(var commit in YieldAllCommits())
{
using(var session = docStore.OpenSession())
{
foreach(var evnt in commit.Events)
{
var entity = evnt.Load<Customer>(evnt.EntityId);
evnt.Apply(entity);
}
session.SaveChanges();
}
}

That works, but it tends to be slow. Worse case here would result in 310,000,000 requests to the server.

Note that this has the nice property that all the changes in a commit are saved in a single commit. We’re going to relax this behavior, and use something better here.

We’ll take the implementation of this LRU cache and add an event for dropping from the cache and iteration.

usging(var bulk = docStore.BulkInsert(allowUpdates: true))
{
var cache = new LeastRecentlyUsedCache<string, Customer>(capacity: 10 * 1000);
cache.OnEvict = c => bulk.Store(c);
foreach(var commit in YieldAllCommits())
{
using(var session = docStore.OpenSession())
{
foreach(var evnt in commit.Events)
{
Customer entity;
if(cache.TryGetValue(evnt.EventId, out entity) == false)
{
using(var session = docStore.OpenSession())
{
entity = session.Load<Customer>(evnt.EventId);
cache.Set(evnt.EventId, entity);
}
}
evnt.Apply(evnt);
}
}
}
foreach(var kvp in cache){
bulk.Store(kvp.Value);
}
}

Here we are using a cache of 10,000 items. With the assumption that we are going to have clustering for events on entities, so a lot of changes on an entity will happen on roughly the same time. We take advantage of that to try to only load each document once. We use bulk insert to flush those changes to the server when needed. This code will handle the case where we flushed out a document from the cache then we get events for it again, but he assumption is that this scenario is much lower.

Accidental code review

I’m trying to get a better insight on a set of log files sent by a customer. So I turned to find a tool that can do that, and I found Inidihiang. There is a x86 vs x64 issue that I had to go through, but then it was just sitting there trying to parse a 34MB log file.

I got annoyed enough that I actually checked, and this is the reason why:

image

Sigh…

I gave up on this and wrote my own stuff.

Tags:

Published at

Originally posted at

Comments (11)

Where do buzzwords retire to?

At lunch today at the office, we had an interesting discussion on the kind of must have technologies. Among the things that were thrown out were:

  • Service Oriented Architecture
  • Single Page Application
  • Cloud
  • Agile
  • TDD
  • Web 2.0
  • REST
  • AJAX
  • Data driven applications

All of those things were stuff that you had to do, and everyone was doing it. And a few years later… they are no longer hot and fancy, but they are probably still in heavy use.

By their out of fashion as much as yellow Crocs:

Tags:

Published at

Originally posted at

Comments (9)

Playing with Roslyn

We do a lot of compiler work in RavenDB. Indexes are one such core example, where we take the C# language and beat both it and our heads against the wall until it agrees to do what we want it to.

A lot of that is happening using the excellent NRefactory library as well as the not so excellent CodeDOM API. Basically, we take a source string, convert it into something that can run, then compile it on the fly and execute it.

I decided to check the implications of using this using a very trivial benchmark:

private static void CompileCodeDome(int i)
{
    var src = @"
class Greeter
{
static void Greet()
{
System.Console.WriteLine(""Hello, World"" + " + i + @");
}
}";
    CodeDomProvider codeDomProvider = new CSharpCodeProvider();
    var compilerParameters = new CompilerParameters
    {
        OutputAssembly= "Greeter.dll",
        GenerateExecutable = false,
        GenerateInMemory = true,
        IncludeDebugInformation = false,
        ReferencedAssemblies =
        {
            typeof (object).Assembly.Location,
            typeof (Enumerable).Assembly.Location
        }
    };
    CompilerResults compileAssemblyFromSource = codeDomProvider.CompileAssemblyFromSource(compilerParameters, src);
    Assembly compiledAssembly = compileAssemblyFromSource.CompiledAssembly;
}

private static void CompileRoslyn(int i)
{
    var syntaxTree = CSharpSyntaxTree.ParseText(@"
class Greeter
{
static void Greet()
{
System.Console.WriteLine(""Hello, World"" + " +i +@");
}
}");

    var compilation = CSharpCompilation.Create("Greeter.dll",
        syntaxTrees: new[] {syntaxTree},
        references: new MetadataReference[]
        {
            new MetadataFileReference(typeof (object).Assembly.Location),
            new MetadataFileReference(typeof (Enumerable).Assembly.Location),
        });

    Assembly assembly;
    using (var file = new MemoryStream())
    {
        var result = compilation.Emit(file);
    }
}

 

I run it several times, and I got (run # on the X axis, milliseconds on the Y axis):

image

The very first Roslyn invocation is very costly. The next are pretty much nothing. Granted, this is a trivial example, but the CodeDOM (which invokes csc) is both much more consistent but much more expensive in general.

Tags:

Published at

Originally posted at

Comments (7)

Interview questions: Large text viewer

I mentioned that we are looking for more people to work for us. This time, we are looking for WPF people for working on the profiler, as well another hush hush project that we’ll hopefully be able to reveal in December.

Because we are now looking for mostly UI people, that gives us a different set of challenges to deal with. How do I get a good candidate when my own WPF knowledge is limited to “Um.. dependency properties, man, that the bomb, man, yeah!”.

Add that to the fact that by the time people got an interview here, we want to be sure that they can code, that present an interesting problem. So we come up with questions like this one. Another question we have is the large text viewer.

We need a tool that can work with text file (logs) of huge size (1GB – 10 GB). We want to be able to open and search through such a file.

Nitpicker corner: I usually use this tool for that, the purpose of the question isn’t to actually to get such a tool, it is to see what kind of code the candidate writes.

We are looking for someone with a lot of skill in the UI side of things, so the large text file stuff is somewhat of a red herring, except that we want to see what they can do beyond just slap a few text boxes around.

Tags:

Published at

Originally posted at

Comments (17)

Interview question: That annoying 3rd party service

We are still in a hiring mode. And today we have completed a new question for the candidates. The task itself is pretty simple, create a form for logging in or creating a new user.

Seems simple enough, I think. But there is a small catch. We’ll provide you the “backend” for this task, which you have to work with. The interface looks like this:

public interface IUserService
{
        void CreateNewUser(User u);
 
        User GetUser(string userId);
}

public class User
{
   public string Name {get;set;}
   public string Email {get;set;}
   public byte[] Sha1HashedPassword {get;set;}
}

The catch here is that we provide that as a dll that include the implementation for this, and as this is supposed to represent a 3rd party service, we made it behave like that. Sometimes the service will take a long time to run. Sometimes it will throw an error (ThisIsTuesdayException), sometime it will take a long time to run and throw an error, etc.

Now, the question is, what is it that I’m looking to learn from the candidate’s code?

Tags:

Published at

Originally posted at

Comments (36)

Question 6 is a trap, a very useful one

In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.

Stop for a moment and ponder that. What do you assume that relative complexity of those questions?

 

 

 

 

 

 

Questions 1 –5 should take anything between 10 – 15  minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.

Question 6 require that you’ll create an index for a 15 TB CSV file, and allow efficient searching on it.

While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.

But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.

Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.

Candidate 1’s code looked something like this:

foreach(var line in File.EnumerateAllLines("big_data.csv"))
{
       var fields = line.Split(',');
       var email = line[2]
       File.WriteAllText(Md5(email), line);
}

Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.

So that isn’t going to work, but at least he had some idea about what is going on.

Candidate 2’s code, however, was:

// prepare
string[] allData = File.ReadAllLines("big_data.csv");
var users = new List<User>();
foreach(var line in allData)
{
     users.Add(User.Parse(line));
}
new XmlSerializer().Serialize(users, "big_data.xml");

// search by:

var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>()
users.AsParallel().Where(x=>x.Email == "the@email.wtf");

So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).

So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.

Tags:

Published at

Originally posted at

Comments (22)

Troubleshooting, when F5 debugging can’t help you

You might have noticed that we have been doing a lot of work on the operational side of things. To make sure that we give you as good a story as possible with regards to the care & feeding of RavenDB. This post isn’t about this. This post is about your applications and systems, and how you are going to react when !@)(*#!@(* happens.

In particular, the question is what do you do when this happens?

This situation can crop up in many disguises. For example, you might be seeing a high memory usage in production, or experiencing growing CPU usage over time, or see request times go up, or any of a hundred and one different production issues that make for a hell of a night (somehow, they almost always happen at nighttime)

Here is how it usually people think about it.

The first thing to do is to understand what is going on. About the hardest thing to handle in this situations is when we have an issue (high memory, high CPU, etc) and no idea why. Usually all the effort is spent just figuring out what and why.. The problem with this process for troubleshooting issues is that it is very easy to jump to conclusions and have an utterly wrong hypothesis. Then you have to go through the rest of the steps to realize it isn’t right.

So the first thing that we need to do is gather information. And this post is primarily about the various ways that you can do that. In RavenDB, we have actually spent a lot of time exposing information to the outside world, so we’ll have an easier time figuring out what is going on. But I’m going to assume that you don’t have that.

The end all tool for this kind of errors in WinDBG. This is the low level tool that gives you access to pretty much anything you can want. It is also very archaic and not very friendly at all. The good thing about it is that you can load a dump into it. A dump is a capture of the process state at a particular point in time. It gives you the ability to see the entire memory contents and all the threads. It is an essential tool, but also the last one I want to use, because it is pretty hard to do so. Dump files can be very big, multiple GB are very common. That is because they contain the full memory dump of the process. There is also mini dumps, which are easier to work with, but don’t contain the memory dump, so you can watch the threads, but not the data.

The .NET Memory Profiler is another great tool for figuring things out. It isn’t usually so good for production analysis, because it uses the Profiler API to figure things out, but it has a wonderful feature of loading dump files (ironically, it can’t handle very large dump files because of memory issuesSmile) and give you a much nicer view of what is going on there.

For high CPU situations, I like to know what is actually going on. And looking at the stack traces is a great way to do that. WinDBG can help here (take a few mini dumps a few seconds apart), but again, that isn’t so nice to use.

Stack Dump is a tool that takes a lot of the pain away for having to deal with that. Because it just output all the threads information, and we have used that successfully in the past to figure out what is going on.

For general performance stuff “requests are slow”, we need to figure out where the slowness actually is. We have had reports that run the gamut from “things are slow, client machine is loaded” to “things are slow, the network QoS settings throttle us”. I like to start by using Fiddler to start figuring those things out. In particular, the statistics window is very helpful:

image

The obvious things are the bytes sent & bytes received. We have a few cases where a customer was actually sending 100s of MB in either of both directions, and was surprised it took some time. If those values are fine, you want to look at the actual performance listing. In particular, look at things like TCP/IP connect, time from client sending the request to server starting to get it, etc.

If you found the problem is actually at the network layer, you might not be able to immediately handle it. You might need to go a level or two lower, and look at the actual TCP traffic. This is where something like Wire Shark comes into play, and it is useful to figure out if you have specific errors at  that level (for example, a bad connection that cause a lot of packet loss will impact performance, but things will still work).

Other tools that are very important include Resource Monitor, Process Explorer and Process Monitor. Those give you a lot of information about what your application is actually doing.

One you have all of that information, you can form a hypothesis and try to test it.

If you own the application in question, the best way to improve your chances of figuring out what is going on is to add logging. Lots & lots of logging. In production, having the logs to support what is going on is crucial. I usually have several levels of logging. For example, what is the traffic in/out of my system. Next there is the actual system operations, especially anything that happens in the background. Finally, there are the debug/trace endpoints that will expose internal state and allow you to tweak various things at runtime.

Having good working knowledge on how to properly utilize the above mention tools is very important, and should be considered to be much more useful than learning a new API or a language feature.

There is no WE in a Job Interview

This is a pet peeve of mine. When interviewing candidates, I’m usually asking some variant of “tell me about a feature you developed that you are proud of”. I’m using this question to gauge several metrics. Things like what is the candidate actually proud of, what was he working on, are they actually proud of what they did?

One of the more annoying tendencies is  for a candidate to give a reply in the form of “what we did was…”. In particular if s/he goes on to never mention something that s/he specifically did. And no, “led the Xyz team in…” is a really bad example.  I’m not hiring your team, in which case I might actually be interested in that. I’m actually interested in the candidate, personally. And if the candidate won’t tell me what it was that s/he did, I’m going to wonder if they played Solitaire all day.

Tags:

Published at

Originally posted at

Comments (23)

A tale of two interviews

We’ve been trying to find more people recently, and that means sifting trouble people. Once that process is done, we ask them to come to our offices for an interview. We recently had two interviews from people that were diametrically opposed to one another. Just to steal my own thunder, we decided not to go forward with either one of them. Before inviting them to an interview, I have them do a few coding questions at home. Those are things like:

  • Given a big CSV file (that fit in memory), allow to speedily query by name or email. The application will run for long period of time, and startup time isn’t very important.
  • Given a very large file (multiple TB), detect what 4MB ranges has changed in the file between consecutive runs of your program.

We’ll call the first one Joe. Joe has a lot of experience, he has been doing software for a long time, and has already had the chance to be a team lead in a couple of previous positions. He sent us some really interesting code. Usually I get a class or three in those answers. In this case, we got something that looked like this:

The main problem I had with his code is just finding where something is actually happening. I discarded the over architecture as someone who is trying to impress in an interview, “See all my beautiful ARCHITECTURE!”, and look at the actual code to actually do the task at hand, which wasn’t bad. 

Joe was full of confidence, he was articulate and well spoken, and appear to have a real passion for the architecture he sent us. “I’ve learned that it is advisable to put proper architecture first” and “That is now my default setting”. I disagree with those statements, but I can live with that.  What bothered me was that we asked a question along the way of “how would you deal with a high memory situation in an application”. What followed was several minutes of very smooth talk about motivating people, giving them the kind of support they need to do the job, etc. Basically, about the only thing it was missing was a part on “the Good of the People” and I would have considered whatever to vote for him. What was glaringly missing in my point of view was anything concrete and actionable.

On the other hand, we have Moe. He is a bit younger, but he already worked with NoSQL databases, which was a major plus. Admittedly, that was as a user of, instead of a developer of, but you can’t have it all. Moe’s code made me sit up and whistle. I setup an interview for the very next day, because looking at the code, there was someone there I wanted to talk to. It was very much to the point, and while it had idiosyncrasies, it showed a lot of promise. Here is the architecture for Moe’s code:

So Moe shows up at the office ,and we start the interview process. And right from the get go it is obvious that Moe is one of those people who don’t do too well in stressful situations like interviews. That is part of the reason why we ask candidates to write code at home. Because it drastically reduce the level of stress they have to deal with.

So I start talking, telling about the company and what we do. The idea is that hopefully this gives him time to compose himself. Then I start asking questions, and he gives mostly the right answers, but I’m lacking focus. I’m assuming that this is probably nervousness, so I bring up his code and go over that with him. He is much more comfortable talking about that. He had a O(logN) solution at one point, and I had to drive him toward an O(1) solution for the same problem, but he got there fairly quickly.

I then asked him what I considered to be a fairly typical question: “What areas you have complete mastery at?” This appear to have stumped him, since he took several minutes to give an answer which basically boiled down to “nothing”.

Okay… this guy is nervous, and he is probably under estimating himself, so let us try to focus the question. I asked whatever he was any good with HTML5 (not at all), then whatever he was good with server side work (have done some work there, but not an expert), and how he would deal with a high memory situation (look at logs, but after that he was stumped). When asked about the actual code he wrote for our test, he said that this was some of the hardest tasks he ever had to deal with.

That summed up to promising, but I’ve a policy of believing people when they tell me bad things about themselves. So this ended up being a negative, which was very frustrating.

The search continues…

Tags:

Published at

Originally posted at

Comments (14)

We’re hiring… come work for us

unnamed

It seems that recently we have been going in rounds. Get more people, train them, and by the time they are trained, we already need more people Smile.

I guess that this is a good problem to have. At any rate, we are currently looking for an experience well rounded developer.

This job availability is for our offices in Hadera, Israel. If you aren’t from Israel, this isn’t for you.

This job is primarily for work on our Profilers line of products. Here is the laundry list:

  • Awesome .NET skills
  • Experience in UI development using WPF, MVVM style
  • Understanding how computers work and how to make them dance
  • History with concurrency & multi threading (concurrent work history not required)
  • Architecture / design abilities

I would like to see open source history, or projects that you can share (in other words, your projects, not employer’s code that you try to give to look at).

Please contact us at jobs@hibernatingrhinos.com if you are interested.

Published at

Originally posted at

Digging into MSMQ

I got into a discussion online about MSMQ and its performance. So I decided to test things out.

What I want to do is to check a few things, in particular, how much messages can I push to and from MSMQ in various configurations.

I created a non transactional queue, and then I run the following code:

var sp = Stopwatch.StartNew();
int count = 0;
while (sp.Elapsed.TotalSeconds < 10)
{
var message = new Message
{
BodyStream = new MemoryStream(data)
};
queue.Send(message);
count++;
}

Console.WriteLine(sp.Elapsed);
Console.WriteLine(count);

This gives me 181,832 messages in 10 seconds ,or 18,183 messages per second. I tried doing the same in a multi threaded fashion, with 8 threads writing to MSMQ, and got an insufficient resources error, so we’ll do this all in a single threaded tests.

Next, the exact same code, but for the Send line, which now looks like this:

queue.Send(message, MessageQueueTransactionType.Single);

This gives me 43,967 messages in 10 seconds, or 4,396 messages per second.

Next I added DTC, which gave me a total of 8,700 messages in ten seconds, or 870 messages per second! Yeah, DTC is evil.

Now, how about reading from it? I used the following code for that:

while (true)
{
try
{
Message receive = queue.Receive(TimeSpan.Zero);
receive.BodyStream.Read(data, 0, data.Length);
}
catch (MessageQueueException e)
{
Console.WriteLine(e);
break;
}
}

Reading from transactional queue, we get 5,955 messages per second for 100,000 messages. And using non transaction queue it can read about 16,000 messages a second.

Note that those are pretty piss poor “benchmarks”, they are intended more to give you a feel for the numbers than anything else.  I’ve mostly used MSMQ within the context of DTC, and it really hit the performance hard.

Tags:

Published at

Originally posted at

Comments (9)

On site Architecture & RavenDB consulting availabilities: Malmo & New York City

I’m going to have availability for on site consulting in Malmo, Sweden  (17 Sep) and in New York City, NY (end of Sep – beginning of Oct).

If you want me to come by and discuss what you are doing (architecture, nhibernate or ravendb), please drop me a line.

I’m especially interested in people who need to do “strange” things with data and data access. We are building a set of tailored database solutions for customers now, and we have seem customers show x750 improvement in performance when we gave them a database that was designed to fit their exact needs, instead of having to contort their application and their database to a seven dimensional loop just to try to store and read what they needed.

When a race condition is what you want…

I have an interesting situation that I am not sure how to resolve. We need to record the last request time for a RavenDB database. Now, this last request time is mostly used to show the user, and to decide when a database is idle, and can be shut down.

As such, it doesn’t have to be really accurate ( a skew of even a few seconds is just fine ). However, under load, there are many requests coming in (in the case presented to us, 4,000 concurrent requests), and they all need to update the last request.

Obviously, in such a scenario, we don’t really care about the actual value. But the question is, how do we deal with that? In particular, I want to avoid a situation where we do a lot of writes to the same value in an unprotected manner, mostly because it is likely to cause contentions between cores.

Any ideas?

It is actually fine for us to go slightly back (so thread A at time T+1 and thread B at time T+2 running concurrently, and the end result is T+1), which is why I said that a race is fine for us. But what I want to avoid is any form of locking / contention.

I wrote the following test code:

class Program
{
    static void Main(string[] args)
    {
        var threads = new List<Thread>();

        var holder = new Holder();

        var mre = new ManualResetEvent(false);

        for (int i = 0; i < 2500; i++)
        {
            var thread = new Thread(delegate()
            {
                mre.WaitOne();
                for (long j = 0; j < 500*1000; j++)
                {
                    holder.Now = j;
                }
            });
            thread.Start();
            threads.Add(thread);
        }

        mre.Set();

        threads.ForEach(t => t.Join());


        Console.WriteLine(holder.Now);
    }
}

public class Holder
{
    public long Now;
}

And it looks like it is doing what I want it to. This creates a lot of contention on the same value, but it is also giving me the right value. And again, the value of right here is very approximate. The problem is that I know how to write thread safe code, I’m not sure if this is a good way to go about doing this.

Note that this code (yes, even with 2,500 threads) runs quite fast, in under a second. Trying to use Interlocked.Exchange is drastically more expensive, and Interlocked.CompareExchange is even worse.

But it is just not sitting well with me.

Tags:

Published at

Originally posted at

Comments (28)

Message passing, performance

I got some replies about the async event loop post, mentioning LMAX Disruptor and performance. I decided to see for myself what the fuss was all about.

You can read about the LMAX Disruptor, but basically, it is a very fast single process messaging library.

I wondered what that meant, so I wrote my own messaging library:

public class Bus<T>
{
Queue<T> q = new Queue<T>();

public void Enqueue(T msg)
{
lock (q)
{
q.Enqueue(msg);
}
}

public bool TryDequeue(out T msg)
{
lock (q)
{
if (q.Count == 0)
{
msg = default(T);
return false;
}
msg = q.Dequeue();
return true;
}
}
}

I think that you’ll agree that this is a thing of beauty and elegant coding. I then tested this with the following code:

public static void Read(Bus<string> bus)
{
int count = 0;
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
string msg;
while (bus.TryDequeue(out msg))
{
count++;
}
}
sp.Stop();

Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", count, sp.Elapsed, (count / sp.Elapsed.TotalSeconds));
}

public static void Send(Bus<string> bus)
{
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (int i = 0; i < 1000; i++)
{
bus.Enqueue("test");
}
}
}

var bus = new Bus<string>();

ThreadPool.QueueUserWorkItem(state => Send(bus));

ThreadPool.QueueUserWorkItem(state => Read(bus));

The result of this code?

145,271,000 msgs in 00:00:10.4597977 for 13,888,510 ops/sec

Now, what happens when we use the DataFlow’s BufferBlock as the bus?

public static async Task ReadAsync(BufferBlock<string> bus)
{
int count = 0;
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
try
{
await bus.ReceiveAsync(TimeSpan.FromMilliseconds(5));
count++;
}
catch (TaskCanceledException e)
{
}
}
sp.Stop();

Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", count, sp.Elapsed, (count / sp.Elapsed.TotalSeconds));
}

public static async Task SendAsync(BufferBlock<string> bus)
{
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (int i = 0; i < 1000; i++)
{
await bus.SendAsync("test");
}
}
}

What we get is:

43,268,149 msgs in 00:00:10 for 4,326,815 ops/sec.

I then decided to check what happens with the .NET port of the LMAX Disruptor. Here is the code:

public class Holder
{
public string Val;
}

internal class CounterHandler : IEventHandler<Holder>
{
public int Count;
public void OnNext(Holder data, long sequence, bool endOfBatch)
{
Count++;
}
}

static void Main(string[] args)
{
var disruptor = new Disruptor.Dsl.Disruptor<Holder>(() => new Holder(), 1024, TaskScheduler.Default);
var counterHandler = new CounterHandler();
disruptor.HandleEventsWith(counterHandler);

var ringBuffer = disruptor.Start();


var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (var i = 0; i < 1000; i++)
{
long sequenceNo = ringBuffer.Next();

ringBuffer[sequenceNo].Val = "test";

ringBuffer.Publish(sequenceNo);
}
}
Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", counterHandler.Count, sp.Elapsed, (counterHandler.Count / sp.Elapsed.TotalSeconds));
}

And the resulting performance is:

29,791,996 msgs in 00:00:10.0003334 for 2,979,100 ops/sec

Now, I’ll be the first to agree that this is really and absolutely not even close to be a fair benchmark. It is testing wildly different things. Distruptor is using a ring buffer, and the BlockBuffer didn’t, and the original Bus implementation just used an unbounded queue.

But that is a very telling benchmark as well. Pretty much because it doesn’t matter. What I need this for is for network protocol handling. As such, even assuming that every single byte is a message, we would have to go far beyond what any reasonable pipe can be expected to be handle.

Tags:

Published at

Originally posted at

Comments (9)

Async event loops in C#

I’m designing a new component, and I want to reduce the amount of complexity involved in dealing with it. This is a networked component, and after designing several such, I wanted to remove one area of complexity, which is the use of explicitly concurrent code. Because of that, I decided to go with the following architecture:

image

 

The network code is just reading messages from the network, and putting them in an in memory queue. Then we have a single threaded event loop that simply goes over the queue and process those messages.

All of the code that is actually processing messages is single threaded, which make it oh so much easier to work with.

Now, I can do this quite easily with a  BlockingCollection<T>, which is how I usually did those sort of things so far. It is simple, robust and easy to understand. It also tie down a full thread for the event loop, which can be a shame if you don’t get a lot of messages.

So I decided to experiment with async approaches. In particular, using the BufferBlock<T> from the DataFlow assemblies.

I came up with the following code:

var q = new BufferBlock<int>(new DataflowBlockOptions
{
CancellationToken = cts.Token,
});

This just create the buffer block, but the nice thing here is that I can setup a “global” cancellation token for all operations on this. The problem is that this actually generate bad exceptions (InvalidOperationException, instead of TaskCancelledException). Well, I’m not sure if bad is the right term, but it isn’t the one I would expect here, at least. If you pass a cancellation token directly to the method, you get the behavior I expected.

At any rate, the code for the event loop now looks like this:

private static async Task EventLoop(BufferBlock<object> bufferBlock, CancellationToken cancellationToken)
{
while (true)
{
object msg;
try
{
msg = await bufferBlock.ReceiveAsync(TimeSpan.FromSeconds(3), cancellationToken);
}
catch (TimeoutException)
{
NoMessagesInTimeout();
continue;
}
catch (Exception e)
{
break;
}
ProcessMessage(msg);
}
}

And that is pretty much it. We have a good way to handle timeouts, and processing messages, and we don’t take up a thread. We can also be easily cancelled. I still need to run this through a lot more testing, in particular, to verify that this doesn’t cause issues when we need to debug this sort of system, but it looks promising.

SQL Injection & Optimizations Path

It is silly, but I just had a conversation with one of our developers on SQL Injection. In RavenDB we support replicating to a relational database, which obviously require using SQL. We are doing things properly, with parameters and everything.

No chance for SQL Injection from there. Great, and end of a very short blog post if it was everything.

As it turned out, there is a significant performance difference between:

@p1 = 'users/1'
@p2 = 'users/2'

DELETE FROM Users WHERE Id IN (@p1,@p1)

And:

DELETE FROM Users WHERE Id IN ('users/1', 'users/2')

Enough that we added that as an option. The reason why related to the vagaries of the database query optimizer, and not really relevant.

This is off by default, obviously. And we use parameters by choice & preference. But we still added a minimal “protection” by adding:

sqlValue.Replace("'", "''")

Considering that this isn’t meant for user’s input (it is for document ids), that is something that is annoying.

Any suggestions on how to improve this?

Tags:

Published at

Originally posted at

Comments (13)

Compression finale

After a fairly long road, we are done. We have all the pieces, generating a shared dictionary, writing using Huffman encoding and getting the results back out.

Hopefully by now the theory behind it is fairly clear to you, and it is time to actually put this into practice.

I’ve 100,000 random users documents in this file, and I want to see what kind of compression I can get from a shared dictionary approach. The project with the code for all of this can be found here: Rhea Compression (most of it is basically a port of FemtoZip to .NET).

The actual file size is 8.49 MB, and when compressing it with Zip (Windows’ send to compress folder) it turn into a 1.93 MB file.

The original file size in bytes is: 8,809,353.

I then tried to compress each document individually using GZipStream, resulting in a total of: 10,004,614 bytes used. Or 9.5 MB! In other words, and not to anyone surprise (I hope), we see an increase in the file size.

However, when using Rhea’s compression, we do the following:

var trainer = new CompressionTrainer();

for (int i = 0; i < json.Length/100; i++)
{
    trainer.TrainOn(json[i*100]);
}

var compressionHandler = trainer.CreateHandler();

This creates a shared dictionary from every 100th document. So we have 1,000 documents as our sampling data. Then, I compressed all the individual documents one at a time.

The result took: 2,593,235 bytes or just 2.47 MB. We got 29% compression ratio! Note that we did this with a 34Kb shared dictionary.

Here is the actual compression code:

foreach (var dic in docs)
{
    size += s.Length;
    ms.SetLength(0);
    compressedSize += compressionHandler.Compress(s, ms);
}

And that is pretty much it. Rhea Compression is on github, and that concludes my spike into compression. In general, Rhea (and FemtoZip, obviously) are meant for very specific scenarios. I have high hopes to be able to use it in the future for doing great things Smile.

Tags:

Published at

Originally posted at

Huffman coding and encoding compressed data

So far, we have dealt with relatively straightforward topics. Given a corpus, find a shared dictionary that we can then use to extract repeated patterns. This is the classic view of compression, even if real world compression schemes are actually a lot more convoluted. Now, that we have compressed text like the following, what comes next?

<-43,6>11,'n<-60,6>Anna Nepal<-40,13><-18,8><-68,8>awest@twinte.gov'}

Obviously, it would be pretty important to encoding this properly. Here is an example of a bad encoding scheme:

[prefix – byte – 1 for compressed, 0 for literal]

[length – int – length of compressed / literal]

[offset – int – back reference if the prefix was set to 1]

The problem here is that we need 6 bytes to encode a one letter literal, and 9 bytes to encode any back reference. Using this inefficient method, encoding the compressed string above actually takes more space than the original one.

As you can imagine, there has been a lot of research into this. The RFC 1951 specification, for example, set us how to do that in detail, although I find this explanation much easier to go through.

Let us see a simple Huffman encoding scheme. We will use “this is an example of a huffman tree” as our text, and first, we’ll build the table.

image

And now we can encoding the above sentence in 135 bits, instead of 288.

For decoding, we just run over the table again, and stop the first time that we hit a leaf.

For example, let us see how we would decode ‘ ‘ and ‘r’.

Space is very common, appearing 7 times in the text. As such, it is encoded as 111. So, start from the root, and move right three times. We are in a leaf node, and we output space.

With the letter R, it only appear in the text 4 times, and its encoding is 10111. So move right from the root node, then left, then three times right, resulting in the leaf node R.

So the results of Huffman encoding using the above table is:

[t] = 0001
[h] = 1010
[i] = 1011
[s] = 0000
[ ] = 111
[i] = 1011
[s] = 0000
[ ] = 111
[a] = 010
[n] = 0011
[ ] = 111
[e] = 011
[x] = 10001
[a] = 010
[m] = 0010
[p] = 10010
[l] = 110010
[e] = 011
[ ] = 111
[o] = 110011
[f] = 1101
[ ] = 111
[a] = 010
[ ] = 111
[h] = 1010
[u] = 10000
[f] = 1101
[f] = 1101
[m] = 0010
[a] = 010
[n] = 0011
[ ] = 111
[t] = 0001
[r] = 10011
[e] = 011
[e] = 011

However, we have a problem here, this result in 135 bits or 17 bytes (vs 36 bytes for the original code). However, 17 bytes contains 136 bits. How do we avoid corruption in this case? The answer is that we have to include an EOF Marker in there. We do that by just adding a new item to our Huffman table and recording this normally.

So, that is all set, and we are ready to go, right?  Almost, we still need to decide how to actually encode the literals and compressed data. GZip (and FemtoZip) uses a really nice way of handling that.

The use of Huffman table means that it contains the frequencies of individual bytes values, but instead of having just 0 – 255 entries in the Huffman table, they uses 513 entries.

256 entries are for the actual byte entries.

256 entries are just lengths.

1 entry for EOF.

What does it means, length entries? It basically means, we store the values 256 – 511 for lengths. When we read the Huffman value, it will give us a value in the range of 0 – 512.

If it is 512, this is the EOF marker, and we are done.

If it is 0 – 255, it is a byte literal, and we can treat it as such.

But if it is in the range of 256 – 511, it means that this is a length. Note that because lengths also cluster around common values, it is very likely that we’ll be able to store most lengths in under a byte.

Following a length, we’ll store the actual offset back. And again, FemtoZip is using Huffman encoding to do that. This is actually being done by encoding the offset into multiple 4 bits entries. The idea is to gain as much as possible from commonalities in the actual byte patterns for the offset.

Yes, it is confusing, and it took me a while to figure it out. But it is also very sleek to see this in action.

On my next post, I’ll bring it all together and attempt to actually generate something that produces compressed data and can decompress it back successfully…

Tags:

Published at

Originally posted at

Comments (6)