Ayende @ Rahien

Refunds available at head office

Career planning: Where do old devs go to?

We are pretty much always looking for new people, what is holding us back from expanding even more rapidly is the time that it takes to get to grips with our codebases and what we do here. But that also means that we usually have at least one outstanding job offer available, because it takes a long time to fill it. But that isn’t the topic for this post.

I started programming in school, I was around 14 / 15 at the time, and I picked up a copy of VB 3.0 and did some fairly unimpressive stuff with it. I count my time as a professional since around 1999 or so. That is the time when I started actually dedicating myself to learning programming as something beyond a hobby. That was 15 years ago.

When we started doing a lot of interviews, I noticed that we had the following pattern regarding developers’ availabilities:

image

That sort of made sense, some people got into software development for the money and left because it didn’t interest them. From the history of Hibernating Rhinos, one of our developers left and is now co-owner in a restaurant, another is a salesman for lasers and other lab stuff.

However, what doesn’t make sense is the ratio that I’m seeing. Where are the people who have been doing software development for decades?

Out of the hundreds of CVs that I have seen, there has been less than 10 that had people over the age of 40. I don’t recall anyone over the age of 50. Note that I’m somewhat biased to hire people with longer experience, because that often means that they don’t need to learn what goes under the hood, they already know.

In fact, looking at the salary tables, there actually isn’t a level of higher than 5 years. After that, you have a team leader position, and then you move into middle management, and then you are basically gone as a developer, I’m guessing.

What is the career path you have as a developer? And note that I’m explicitly throwing out management positions. It seems that those are very rare in our industry.

Microsoft has the notion of Distinguished Engineer and Technical Fellow, for people who actually have decades of experience. In my head, I have this image of a SWAT team that you throw at the hard problems Smile.

Outside of very big companies, those seem to be very rare.  And that is kind of sad.

In Hibernating Rhinos, we plan to actually have those kind of long career paths, but you’ll need to ask me in 10 – 20 years how that turned out to be.

Tags:

Published at

Originally posted at

Comments (49)

Mono frustrations

I’m porting Voron to Mono (currently testing on Ubuntu). I’m using Mono 3.2.8, FWIW, and working in MonoDevelop.

So far, I have run into the following tangles that are annoying.  Attempting to write to memory that is write protected results in null reference exception, instead of access violation exception. I suspect that this is done because NRE is generated on any SIGSEGV, but that led me to a very different path of discovery.

Also, consider the following code:

using System.IO;
using System.IO.Compression;

namespace HelloWorld
{
class MainClass
{

public static void Main (string[] args)
{
new ZipArchive (new MemoryStream ());
}
}
}

This results in the following error:

Unhandled Exception:
    System.NotImplementedException: The requested feature is not implemented.
        at HelloWorld.MainClass.Main (System.String[] args) [0x00006] in /home/ayende/HelloWorld/HelloWorld/Program.cs:11
        [ERROR] FATAL UNHANDLED EXCEPTION: System.NotImplementedException: The requested feature is not implemented.
        at HelloWorld.MainClass.Main (System.String[] args) [0x00006] in /home/ayende/HelloWorld/HelloWorld/Program.cs:11

This is annoying in that it isn’t implemented, but worse from my point of view is that I don’t see any ZipArchive in the stack trace. That made me think that it was my code that was throwing this.

Tags:

Published at

Originally posted at

Comments (16)

Optimizing event processing

During the RavenDB Days conference, I got a lot of questions from customers. Here is one of them.

There is a migration process that deals with event sourcing system. So we have 10,000,000 commits with 5 – 50 events per commit. Each event result in a property update to an entity.

That gives us roughly 300,000,000 events to process. The trivial way to solve this would be:

foreach(var commit in YieldAllCommits())
{
using(var session = docStore.OpenSession())
{
foreach(var evnt in commit.Events)
{
var entity = evnt.Load<Customer>(evnt.EntityId);
evnt.Apply(entity);
}
session.SaveChanges();
}
}

That works, but it tends to be slow. Worse case here would result in 310,000,000 requests to the server.

Note that this has the nice property that all the changes in a commit are saved in a single commit. We’re going to relax this behavior, and use something better here.

We’ll take the implementation of this LRU cache and add an event for dropping from the cache and iteration.

usging(var bulk = docStore.BulkInsert(allowUpdates: true))
{
var cache = new LeastRecentlyUsedCache<string, Customer>(capacity: 10 * 1000);
cache.OnEvict = c => bulk.Store(c);
foreach(var commit in YieldAllCommits())
{
using(var session = docStore.OpenSession())
{
foreach(var evnt in commit.Events)
{
Customer entity;
if(cache.TryGetValue(evnt.EventId, out entity) == false)
{
using(var session = docStore.OpenSession())
{
entity = session.Load<Customer>(evnt.EventId);
cache.Set(evnt.EventId, entity);
}
}
evnt.Apply(evnt);
}
}
}
foreach(var kvp in cache){
bulk.Store(kvp.Value);
}
}

Here we are using a cache of 10,000 items. With the assumption that we are going to have clustering for events on entities, so a lot of changes on an entity will happen on roughly the same time. We take advantage of that to try to only load each document once. We use bulk insert to flush those changes to the server when needed. This code will handle the case where we flushed out a document from the cache then we get events for it again, but he assumption is that this scenario is much lower.

Accidental code review

I’m trying to get a better insight on a set of log files sent by a customer. So I turned to find a tool that can do that, and I found Inidihiang. There is a x86 vs x64 issue that I had to go through, but then it was just sitting there trying to parse a 34MB log file.

I got annoyed enough that I actually checked, and this is the reason why:

image

Sigh…

I gave up on this and wrote my own stuff.

Tags:

Published at

Originally posted at

Comments (11)

Where do buzzwords retire to?

At lunch today at the office, we had an interesting discussion on the kind of must have technologies. Among the things that were thrown out were:

  • Service Oriented Architecture
  • Single Page Application
  • Cloud
  • Agile
  • TDD
  • Web 2.0
  • REST
  • AJAX
  • Data driven applications

All of those things were stuff that you had to do, and everyone was doing it. And a few years later… they are no longer hot and fancy, but they are probably still in heavy use.

By their out of fashion as much as yellow Crocs:

Tags:

Published at

Originally posted at

Comments (9)

Playing with Roslyn

We do a lot of compiler work in RavenDB. Indexes are one such core example, where we take the C# language and beat both it and our heads against the wall until it agrees to do what we want it to.

A lot of that is happening using the excellent NRefactory library as well as the not so excellent CodeDOM API. Basically, we take a source string, convert it into something that can run, then compile it on the fly and execute it.

I decided to check the implications of using this using a very trivial benchmark:

private static void CompileCodeDome(int i)
{
    var src = @"
class Greeter
{
static void Greet()
{
System.Console.WriteLine(""Hello, World"" + " + i + @");
}
}";
    CodeDomProvider codeDomProvider = new CSharpCodeProvider();
    var compilerParameters = new CompilerParameters
    {
        OutputAssembly= "Greeter.dll",
        GenerateExecutable = false,
        GenerateInMemory = true,
        IncludeDebugInformation = false,
        ReferencedAssemblies =
        {
            typeof (object).Assembly.Location,
            typeof (Enumerable).Assembly.Location
        }
    };
    CompilerResults compileAssemblyFromSource = codeDomProvider.CompileAssemblyFromSource(compilerParameters, src);
    Assembly compiledAssembly = compileAssemblyFromSource.CompiledAssembly;
}

private static void CompileRoslyn(int i)
{
    var syntaxTree = CSharpSyntaxTree.ParseText(@"
class Greeter
{
static void Greet()
{
System.Console.WriteLine(""Hello, World"" + " +i +@");
}
}");

    var compilation = CSharpCompilation.Create("Greeter.dll",
        syntaxTrees: new[] {syntaxTree},
        references: new MetadataReference[]
        {
            new MetadataFileReference(typeof (object).Assembly.Location),
            new MetadataFileReference(typeof (Enumerable).Assembly.Location),
        });

    Assembly assembly;
    using (var file = new MemoryStream())
    {
        var result = compilation.Emit(file);
    }
}

 

I run it several times, and I got (run # on the X axis, milliseconds on the Y axis):

image

The very first Roslyn invocation is very costly. The next are pretty much nothing. Granted, this is a trivial example, but the CodeDOM (which invokes csc) is both much more consistent but much more expensive in general.

Tags:

Published at

Originally posted at

Comments (7)

Interview questions: Large text viewer

I mentioned that we are looking for more people to work for us. This time, we are looking for WPF people for working on the profiler, as well another hush hush project that we’ll hopefully be able to reveal in December.

Because we are now looking for mostly UI people, that gives us a different set of challenges to deal with. How do I get a good candidate when my own WPF knowledge is limited to “Um.. dependency properties, man, that the bomb, man, yeah!”.

Add that to the fact that by the time people got an interview here, we want to be sure that they can code, that present an interesting problem. So we come up with questions like this one. Another question we have is the large text viewer.

We need a tool that can work with text file (logs) of huge size (1GB – 10 GB). We want to be able to open and search through such a file.

Nitpicker corner: I usually use this tool for that, the purpose of the question isn’t to actually to get such a tool, it is to see what kind of code the candidate writes.

We are looking for someone with a lot of skill in the UI side of things, so the large text file stuff is somewhat of a red herring, except that we want to see what they can do beyond just slap a few text boxes around.

Tags:

Published at

Originally posted at

Comments (17)

Interview question: That annoying 3rd party service

We are still in a hiring mode. And today we have completed a new question for the candidates. The task itself is pretty simple, create a form for logging in or creating a new user.

Seems simple enough, I think. But there is a small catch. We’ll provide you the “backend” for this task, which you have to work with. The interface looks like this:

public interface IUserService
{
        void CreateNewUser(User u);
 
        User GetUser(string userId);
}

public class User
{
   public string Name {get;set;}
   public string Email {get;set;}
   public byte[] Sha1HashedPassword {get;set;}
}

The catch here is that we provide that as a dll that include the implementation for this, and as this is supposed to represent a 3rd party service, we made it behave like that. Sometimes the service will take a long time to run. Sometimes it will throw an error (ThisIsTuesdayException), sometime it will take a long time to run and throw an error, etc.

Now, the question is, what is it that I’m looking to learn from the candidate’s code?

Tags:

Published at

Originally posted at

Comments (36)

Question 6 is a trap, a very useful one

In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.

Stop for a moment and ponder that. What do you assume that relative complexity of those questions?

 

 

 

 

 

 

Questions 1 –5 should take anything between 10 – 15  minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.

Question 6 require that you’ll create an index for a 15 TB CSV file, and allow efficient searching on it.

While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.

But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.

Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.

Candidate 1’s code looked something like this:

foreach(var line in File.EnumerateAllLines("big_data.csv"))
{
       var fields = line.Split(',');
       var email = line[2]
       File.WriteAllText(Md5(email), line);
}

Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.

So that isn’t going to work, but at least he had some idea about what is going on.

Candidate 2’s code, however, was:

// prepare
string[] allData = File.ReadAllLines("big_data.csv");
var users = new List<User>();
foreach(var line in allData)
{
     users.Add(User.Parse(line));
}
new XmlSerializer().Serialize(users, "big_data.xml");

// search by:

var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>()
users.AsParallel().Where(x=>x.Email == "the@email.wtf");

So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).

So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.

Tags:

Published at

Originally posted at

Comments (22)

Troubleshooting, when F5 debugging can’t help you

You might have noticed that we have been doing a lot of work on the operational side of things. To make sure that we give you as good a story as possible with regards to the care & feeding of RavenDB. This post isn’t about this. This post is about your applications and systems, and how you are going to react when !@)(*#!@(* happens.

In particular, the question is what do you do when this happens?

This situation can crop up in many disguises. For example, you might be seeing a high memory usage in production, or experiencing growing CPU usage over time, or see request times go up, or any of a hundred and one different production issues that make for a hell of a night (somehow, they almost always happen at nighttime)

Here is how it usually people think about it.

The first thing to do is to understand what is going on. About the hardest thing to handle in this situations is when we have an issue (high memory, high CPU, etc) and no idea why. Usually all the effort is spent just figuring out what and why.. The problem with this process for troubleshooting issues is that it is very easy to jump to conclusions and have an utterly wrong hypothesis. Then you have to go through the rest of the steps to realize it isn’t right.

So the first thing that we need to do is gather information. And this post is primarily about the various ways that you can do that. In RavenDB, we have actually spent a lot of time exposing information to the outside world, so we’ll have an easier time figuring out what is going on. But I’m going to assume that you don’t have that.

The end all tool for this kind of errors in WinDBG. This is the low level tool that gives you access to pretty much anything you can want. It is also very archaic and not very friendly at all. The good thing about it is that you can load a dump into it. A dump is a capture of the process state at a particular point in time. It gives you the ability to see the entire memory contents and all the threads. It is an essential tool, but also the last one I want to use, because it is pretty hard to do so. Dump files can be very big, multiple GB are very common. That is because they contain the full memory dump of the process. There is also mini dumps, which are easier to work with, but don’t contain the memory dump, so you can watch the threads, but not the data.

The .NET Memory Profiler is another great tool for figuring things out. It isn’t usually so good for production analysis, because it uses the Profiler API to figure things out, but it has a wonderful feature of loading dump files (ironically, it can’t handle very large dump files because of memory issuesSmile) and give you a much nicer view of what is going on there.

For high CPU situations, I like to know what is actually going on. And looking at the stack traces is a great way to do that. WinDBG can help here (take a few mini dumps a few seconds apart), but again, that isn’t so nice to use.

Stack Dump is a tool that takes a lot of the pain away for having to deal with that. Because it just output all the threads information, and we have used that successfully in the past to figure out what is going on.

For general performance stuff “requests are slow”, we need to figure out where the slowness actually is. We have had reports that run the gamut from “things are slow, client machine is loaded” to “things are slow, the network QoS settings throttle us”. I like to start by using Fiddler to start figuring those things out. In particular, the statistics window is very helpful:

image

The obvious things are the bytes sent & bytes received. We have a few cases where a customer was actually sending 100s of MB in either of both directions, and was surprised it took some time. If those values are fine, you want to look at the actual performance listing. In particular, look at things like TCP/IP connect, time from client sending the request to server starting to get it, etc.

If you found the problem is actually at the network layer, you might not be able to immediately handle it. You might need to go a level or two lower, and look at the actual TCP traffic. This is where something like Wire Shark comes into play, and it is useful to figure out if you have specific errors at  that level (for example, a bad connection that cause a lot of packet loss will impact performance, but things will still work).

Other tools that are very important include Resource Monitor, Process Explorer and Process Monitor. Those give you a lot of information about what your application is actually doing.

One you have all of that information, you can form a hypothesis and try to test it.

If you own the application in question, the best way to improve your chances of figuring out what is going on is to add logging. Lots & lots of logging. In production, having the logs to support what is going on is crucial. I usually have several levels of logging. For example, what is the traffic in/out of my system. Next there is the actual system operations, especially anything that happens in the background. Finally, there are the debug/trace endpoints that will expose internal state and allow you to tweak various things at runtime.

Having good working knowledge on how to properly utilize the above mention tools is very important, and should be considered to be much more useful than learning a new API or a language feature.

There is no WE in a Job Interview

This is a pet peeve of mine. When interviewing candidates, I’m usually asking some variant of “tell me about a feature you developed that you are proud of”. I’m using this question to gauge several metrics. Things like what is the candidate actually proud of, what was he working on, are they actually proud of what they did?

One of the more annoying tendencies is  for a candidate to give a reply in the form of “what we did was…”. In particular if s/he goes on to never mention something that s/he specifically did. And no, “led the Xyz team in…” is a really bad example.  I’m not hiring your team, in which case I might actually be interested in that. I’m actually interested in the candidate, personally. And if the candidate won’t tell me what it was that s/he did, I’m going to wonder if they played Solitaire all day.

Tags:

Published at

Originally posted at

Comments (23)

A tale of two interviews

We’ve been trying to find more people recently, and that means sifting trouble people. Once that process is done, we ask them to come to our offices for an interview. We recently had two interviews from people that were diametrically opposed to one another. Just to steal my own thunder, we decided not to go forward with either one of them. Before inviting them to an interview, I have them do a few coding questions at home. Those are things like:

  • Given a big CSV file (that fit in memory), allow to speedily query by name or email. The application will run for long period of time, and startup time isn’t very important.
  • Given a very large file (multiple TB), detect what 4MB ranges has changed in the file between consecutive runs of your program.

We’ll call the first one Joe. Joe has a lot of experience, he has been doing software for a long time, and has already had the chance to be a team lead in a couple of previous positions. He sent us some really interesting code. Usually I get a class or three in those answers. In this case, we got something that looked like this:

The main problem I had with his code is just finding where something is actually happening. I discarded the over architecture as someone who is trying to impress in an interview, “See all my beautiful ARCHITECTURE!”, and look at the actual code to actually do the task at hand, which wasn’t bad. 

Joe was full of confidence, he was articulate and well spoken, and appear to have a real passion for the architecture he sent us. “I’ve learned that it is advisable to put proper architecture first” and “That is now my default setting”. I disagree with those statements, but I can live with that.  What bothered me was that we asked a question along the way of “how would you deal with a high memory situation in an application”. What followed was several minutes of very smooth talk about motivating people, giving them the kind of support they need to do the job, etc. Basically, about the only thing it was missing was a part on “the Good of the People” and I would have considered whatever to vote for him. What was glaringly missing in my point of view was anything concrete and actionable.

On the other hand, we have Moe. He is a bit younger, but he already worked with NoSQL databases, which was a major plus. Admittedly, that was as a user of, instead of a developer of, but you can’t have it all. Moe’s code made me sit up and whistle. I setup an interview for the very next day, because looking at the code, there was someone there I wanted to talk to. It was very much to the point, and while it had idiosyncrasies, it showed a lot of promise. Here is the architecture for Moe’s code:

So Moe shows up at the office ,and we start the interview process. And right from the get go it is obvious that Moe is one of those people who don’t do too well in stressful situations like interviews. That is part of the reason why we ask candidates to write code at home. Because it drastically reduce the level of stress they have to deal with.

So I start talking, telling about the company and what we do. The idea is that hopefully this gives him time to compose himself. Then I start asking questions, and he gives mostly the right answers, but I’m lacking focus. I’m assuming that this is probably nervousness, so I bring up his code and go over that with him. He is much more comfortable talking about that. He had a O(logN) solution at one point, and I had to drive him toward an O(1) solution for the same problem, but he got there fairly quickly.

I then asked him what I considered to be a fairly typical question: “What areas you have complete mastery at?” This appear to have stumped him, since he took several minutes to give an answer which basically boiled down to “nothing”.

Okay… this guy is nervous, and he is probably under estimating himself, so let us try to focus the question. I asked whatever he was any good with HTML5 (not at all), then whatever he was good with server side work (have done some work there, but not an expert), and how he would deal with a high memory situation (look at logs, but after that he was stumped). When asked about the actual code he wrote for our test, he said that this was some of the hardest tasks he ever had to deal with.

That summed up to promising, but I’ve a policy of believing people when they tell me bad things about themselves. So this ended up being a negative, which was very frustrating.

The search continues…

Tags:

Published at

Originally posted at

Comments (14)

We’re hiring… come work for us

unnamed

It seems that recently we have been going in rounds. Get more people, train them, and by the time they are trained, we already need more people Smile.

I guess that this is a good problem to have. At any rate, we are currently looking for an experience well rounded developer.

This job availability is for our offices in Hadera, Israel. If you aren’t from Israel, this isn’t for you.

This job is primarily for work on our Profilers line of products. Here is the laundry list:

  • Awesome .NET skills
  • Experience in UI development using WPF, MVVM style
  • Understanding how computers work and how to make them dance
  • History with concurrency & multi threading (concurrent work history not required)
  • Architecture / design abilities

I would like to see open source history, or projects that you can share (in other words, your projects, not employer’s code that you try to give to look at).

Please contact us at jobs@hibernatingrhinos.com if you are interested.

Published at

Originally posted at

Digging into MSMQ

I got into a discussion online about MSMQ and its performance. So I decided to test things out.

What I want to do is to check a few things, in particular, how much messages can I push to and from MSMQ in various configurations.

I created a non transactional queue, and then I run the following code:

var sp = Stopwatch.StartNew();
int count = 0;
while (sp.Elapsed.TotalSeconds < 10)
{
var message = new Message
{
BodyStream = new MemoryStream(data)
};
queue.Send(message);
count++;
}

Console.WriteLine(sp.Elapsed);
Console.WriteLine(count);

This gives me 181,832 messages in 10 seconds ,or 18,183 messages per second. I tried doing the same in a multi threaded fashion, with 8 threads writing to MSMQ, and got an insufficient resources error, so we’ll do this all in a single threaded tests.

Next, the exact same code, but for the Send line, which now looks like this:

queue.Send(message, MessageQueueTransactionType.Single);

This gives me 43,967 messages in 10 seconds, or 4,396 messages per second.

Next I added DTC, which gave me a total of 8,700 messages in ten seconds, or 870 messages per second! Yeah, DTC is evil.

Now, how about reading from it? I used the following code for that:

while (true)
{
try
{
Message receive = queue.Receive(TimeSpan.Zero);
receive.BodyStream.Read(data, 0, data.Length);
}
catch (MessageQueueException e)
{
Console.WriteLine(e);
break;
}
}

Reading from transactional queue, we get 5,955 messages per second for 100,000 messages. And using non transaction queue it can read about 16,000 messages a second.

Note that those are pretty piss poor “benchmarks”, they are intended more to give you a feel for the numbers than anything else.  I’ve mostly used MSMQ within the context of DTC, and it really hit the performance hard.

Tags:

Published at

Originally posted at

Comments (9)

On site Architecture & RavenDB consulting availabilities: Malmo & New York City

I’m going to have availability for on site consulting in Malmo, Sweden  (17 Sep) and in New York City, NY (end of Sep – beginning of Oct).

If you want me to come by and discuss what you are doing (architecture, nhibernate or ravendb), please drop me a line.

I’m especially interested in people who need to do “strange” things with data and data access. We are building a set of tailored database solutions for customers now, and we have seem customers show x750 improvement in performance when we gave them a database that was designed to fit their exact needs, instead of having to contort their application and their database to a seven dimensional loop just to try to store and read what they needed.

When a race condition is what you want…

I have an interesting situation that I am not sure how to resolve. We need to record the last request time for a RavenDB database. Now, this last request time is mostly used to show the user, and to decide when a database is idle, and can be shut down.

As such, it doesn’t have to be really accurate ( a skew of even a few seconds is just fine ). However, under load, there are many requests coming in (in the case presented to us, 4,000 concurrent requests), and they all need to update the last request.

Obviously, in such a scenario, we don’t really care about the actual value. But the question is, how do we deal with that? In particular, I want to avoid a situation where we do a lot of writes to the same value in an unprotected manner, mostly because it is likely to cause contentions between cores.

Any ideas?

It is actually fine for us to go slightly back (so thread A at time T+1 and thread B at time T+2 running concurrently, and the end result is T+1), which is why I said that a race is fine for us. But what I want to avoid is any form of locking / contention.

I wrote the following test code:

class Program
{
    static void Main(string[] args)
    {
        var threads = new List<Thread>();

        var holder = new Holder();

        var mre = new ManualResetEvent(false);

        for (int i = 0; i < 2500; i++)
        {
            var thread = new Thread(delegate()
            {
                mre.WaitOne();
                for (long j = 0; j < 500*1000; j++)
                {
                    holder.Now = j;
                }
            });
            thread.Start();
            threads.Add(thread);
        }

        mre.Set();

        threads.ForEach(t => t.Join());


        Console.WriteLine(holder.Now);
    }
}

public class Holder
{
    public long Now;
}

And it looks like it is doing what I want it to. This creates a lot of contention on the same value, but it is also giving me the right value. And again, the value of right here is very approximate. The problem is that I know how to write thread safe code, I’m not sure if this is a good way to go about doing this.

Note that this code (yes, even with 2,500 threads) runs quite fast, in under a second. Trying to use Interlocked.Exchange is drastically more expensive, and Interlocked.CompareExchange is even worse.

But it is just not sitting well with me.

Tags:

Published at

Originally posted at

Comments (28)

Message passing, performance

I got some replies about the async event loop post, mentioning LMAX Disruptor and performance. I decided to see for myself what the fuss was all about.

You can read about the LMAX Disruptor, but basically, it is a very fast single process messaging library.

I wondered what that meant, so I wrote my own messaging library:

public class Bus<T>
{
Queue<T> q = new Queue<T>();

public void Enqueue(T msg)
{
lock (q)
{
q.Enqueue(msg);
}
}

public bool TryDequeue(out T msg)
{
lock (q)
{
if (q.Count == 0)
{
msg = default(T);
return false;
}
msg = q.Dequeue();
return true;
}
}
}

I think that you’ll agree that this is a thing of beauty and elegant coding. I then tested this with the following code:

public static void Read(Bus<string> bus)
{
int count = 0;
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
string msg;
while (bus.TryDequeue(out msg))
{
count++;
}
}
sp.Stop();

Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", count, sp.Elapsed, (count / sp.Elapsed.TotalSeconds));
}

public static void Send(Bus<string> bus)
{
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (int i = 0; i < 1000; i++)
{
bus.Enqueue("test");
}
}
}

var bus = new Bus<string>();

ThreadPool.QueueUserWorkItem(state => Send(bus));

ThreadPool.QueueUserWorkItem(state => Read(bus));

The result of this code?

145,271,000 msgs in 00:00:10.4597977 for 13,888,510 ops/sec

Now, what happens when we use the DataFlow’s BufferBlock as the bus?

public static async Task ReadAsync(BufferBlock<string> bus)
{
int count = 0;
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
try
{
await bus.ReceiveAsync(TimeSpan.FromMilliseconds(5));
count++;
}
catch (TaskCanceledException e)
{
}
}
sp.Stop();

Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", count, sp.Elapsed, (count / sp.Elapsed.TotalSeconds));
}

public static async Task SendAsync(BufferBlock<string> bus)
{
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (int i = 0; i < 1000; i++)
{
await bus.SendAsync("test");
}
}
}

What we get is:

43,268,149 msgs in 00:00:10 for 4,326,815 ops/sec.

I then decided to check what happens with the .NET port of the LMAX Disruptor. Here is the code:

public class Holder
{
public string Val;
}

internal class CounterHandler : IEventHandler<Holder>
{
public int Count;
public void OnNext(Holder data, long sequence, bool endOfBatch)
{
Count++;
}
}

static void Main(string[] args)
{
var disruptor = new Disruptor.Dsl.Disruptor<Holder>(() => new Holder(), 1024, TaskScheduler.Default);
var counterHandler = new CounterHandler();
disruptor.HandleEventsWith(counterHandler);

var ringBuffer = disruptor.Start();


var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (var i = 0; i < 1000; i++)
{
long sequenceNo = ringBuffer.Next();

ringBuffer[sequenceNo].Val = "test";

ringBuffer.Publish(sequenceNo);
}
}
Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", counterHandler.Count, sp.Elapsed, (counterHandler.Count / sp.Elapsed.TotalSeconds));
}

And the resulting performance is:

29,791,996 msgs in 00:00:10.0003334 for 2,979,100 ops/sec

Now, I’ll be the first to agree that this is really and absolutely not even close to be a fair benchmark. It is testing wildly different things. Distruptor is using a ring buffer, and the BlockBuffer didn’t, and the original Bus implementation just used an unbounded queue.

But that is a very telling benchmark as well. Pretty much because it doesn’t matter. What I need this for is for network protocol handling. As such, even assuming that every single byte is a message, we would have to go far beyond what any reasonable pipe can be expected to be handle.

Tags:

Published at

Originally posted at

Comments (9)

Async event loops in C#

I’m designing a new component, and I want to reduce the amount of complexity involved in dealing with it. This is a networked component, and after designing several such, I wanted to remove one area of complexity, which is the use of explicitly concurrent code. Because of that, I decided to go with the following architecture:

image

 

The network code is just reading messages from the network, and putting them in an in memory queue. Then we have a single threaded event loop that simply goes over the queue and process those messages.

All of the code that is actually processing messages is single threaded, which make it oh so much easier to work with.

Now, I can do this quite easily with a  BlockingCollection<T>, which is how I usually did those sort of things so far. It is simple, robust and easy to understand. It also tie down a full thread for the event loop, which can be a shame if you don’t get a lot of messages.

So I decided to experiment with async approaches. In particular, using the BufferBlock<T> from the DataFlow assemblies.

I came up with the following code:

var q = new BufferBlock<int>(new DataflowBlockOptions
{
CancellationToken = cts.Token,
});

This just create the buffer block, but the nice thing here is that I can setup a “global” cancellation token for all operations on this. The problem is that this actually generate bad exceptions (InvalidOperationException, instead of TaskCancelledException). Well, I’m not sure if bad is the right term, but it isn’t the one I would expect here, at least. If you pass a cancellation token directly to the method, you get the behavior I expected.

At any rate, the code for the event loop now looks like this:

private static async Task EventLoop(BufferBlock<object> bufferBlock, CancellationToken cancellationToken)
{
while (true)
{
object msg;
try
{
msg = await bufferBlock.ReceiveAsync(TimeSpan.FromSeconds(3), cancellationToken);
}
catch (TimeoutException)
{
NoMessagesInTimeout();
continue;
}
catch (Exception e)
{
break;
}
ProcessMessage(msg);
}
}

And that is pretty much it. We have a good way to handle timeouts, and processing messages, and we don’t take up a thread. We can also be easily cancelled. I still need to run this through a lot more testing, in particular, to verify that this doesn’t cause issues when we need to debug this sort of system, but it looks promising.

SQL Injection & Optimizations Path

It is silly, but I just had a conversation with one of our developers on SQL Injection. In RavenDB we support replicating to a relational database, which obviously require using SQL. We are doing things properly, with parameters and everything.

No chance for SQL Injection from there. Great, and end of a very short blog post if it was everything.

As it turned out, there is a significant performance difference between:

@p1 = 'users/1'
@p2 = 'users/2'

DELETE FROM Users WHERE Id IN (@p1,@p1)

And:

DELETE FROM Users WHERE Id IN ('users/1', 'users/2')

Enough that we added that as an option. The reason why related to the vagaries of the database query optimizer, and not really relevant.

This is off by default, obviously. And we use parameters by choice & preference. But we still added a minimal “protection” by adding:

sqlValue.Replace("'", "''")

Considering that this isn’t meant for user’s input (it is for document ids), that is something that is annoying.

Any suggestions on how to improve this?

Tags:

Published at

Originally posted at

Comments (13)

Compression finale

After a fairly long road, we are done. We have all the pieces, generating a shared dictionary, writing using Huffman encoding and getting the results back out.

Hopefully by now the theory behind it is fairly clear to you, and it is time to actually put this into practice.

I’ve 100,000 random users documents in this file, and I want to see what kind of compression I can get from a shared dictionary approach. The project with the code for all of this can be found here: Rhea Compression (most of it is basically a port of FemtoZip to .NET).

The actual file size is 8.49 MB, and when compressing it with Zip (Windows’ send to compress folder) it turn into a 1.93 MB file.

The original file size in bytes is: 8,809,353.

I then tried to compress each document individually using GZipStream, resulting in a total of: 10,004,614 bytes used. Or 9.5 MB! In other words, and not to anyone surprise (I hope), we see an increase in the file size.

However, when using Rhea’s compression, we do the following:

var trainer = new CompressionTrainer();

for (int i = 0; i < json.Length/100; i++)
{
    trainer.TrainOn(json[i*100]);
}

var compressionHandler = trainer.CreateHandler();

This creates a shared dictionary from every 100th document. So we have 1,000 documents as our sampling data. Then, I compressed all the individual documents one at a time.

The result took: 2,593,235 bytes or just 2.47 MB. We got 29% compression ratio! Note that we did this with a 34Kb shared dictionary.

Here is the actual compression code:

foreach (var dic in docs)
{
    size += s.Length;
    ms.SetLength(0);
    compressedSize += compressionHandler.Compress(s, ms);
}

And that is pretty much it. Rhea Compression is on github, and that concludes my spike into compression. In general, Rhea (and FemtoZip, obviously) are meant for very specific scenarios. I have high hopes to be able to use it in the future for doing great things Smile.

Tags:

Published at

Originally posted at

Huffman coding and encoding compressed data

So far, we have dealt with relatively straightforward topics. Given a corpus, find a shared dictionary that we can then use to extract repeated patterns. This is the classic view of compression, even if real world compression schemes are actually a lot more convoluted. Now, that we have compressed text like the following, what comes next?

<-43,6>11,'n<-60,6>Anna Nepal<-40,13><-18,8><-68,8>awest@twinte.gov'}

Obviously, it would be pretty important to encoding this properly. Here is an example of a bad encoding scheme:

[prefix – byte – 1 for compressed, 0 for literal]

[length – int – length of compressed / literal]

[offset – int – back reference if the prefix was set to 1]

The problem here is that we need 6 bytes to encode a one letter literal, and 9 bytes to encode any back reference. Using this inefficient method, encoding the compressed string above actually takes more space than the original one.

As you can imagine, there has been a lot of research into this. The RFC 1951 specification, for example, set us how to do that in detail, although I find this explanation much easier to go through.

Let us see a simple Huffman encoding scheme. We will use “this is an example of a huffman tree” as our text, and first, we’ll build the table.

image

And now we can encoding the above sentence in 135 bits, instead of 288.

For decoding, we just run over the table again, and stop the first time that we hit a leaf.

For example, let us see how we would decode ‘ ‘ and ‘r’.

Space is very common, appearing 7 times in the text. As such, it is encoded as 111. So, start from the root, and move right three times. We are in a leaf node, and we output space.

With the letter R, it only appear in the text 4 times, and its encoding is 10111. So move right from the root node, then left, then three times right, resulting in the leaf node R.

So the results of Huffman encoding using the above table is:

[t] = 0001
[h] = 1010
[i] = 1011
[s] = 0000
[ ] = 111
[i] = 1011
[s] = 0000
[ ] = 111
[a] = 010
[n] = 0011
[ ] = 111
[e] = 011
[x] = 10001
[a] = 010
[m] = 0010
[p] = 10010
[l] = 110010
[e] = 011
[ ] = 111
[o] = 110011
[f] = 1101
[ ] = 111
[a] = 010
[ ] = 111
[h] = 1010
[u] = 10000
[f] = 1101
[f] = 1101
[m] = 0010
[a] = 010
[n] = 0011
[ ] = 111
[t] = 0001
[r] = 10011
[e] = 011
[e] = 011

However, we have a problem here, this result in 135 bits or 17 bytes (vs 36 bytes for the original code). However, 17 bytes contains 136 bits. How do we avoid corruption in this case? The answer is that we have to include an EOF Marker in there. We do that by just adding a new item to our Huffman table and recording this normally.

So, that is all set, and we are ready to go, right?  Almost, we still need to decide how to actually encode the literals and compressed data. GZip (and FemtoZip) uses a really nice way of handling that.

The use of Huffman table means that it contains the frequencies of individual bytes values, but instead of having just 0 – 255 entries in the Huffman table, they uses 513 entries.

256 entries are for the actual byte entries.

256 entries are just lengths.

1 entry for EOF.

What does it means, length entries? It basically means, we store the values 256 – 511 for lengths. When we read the Huffman value, it will give us a value in the range of 0 – 512.

If it is 512, this is the EOF marker, and we are done.

If it is 0 – 255, it is a byte literal, and we can treat it as such.

But if it is in the range of 256 – 511, it means that this is a length. Note that because lengths also cluster around common values, it is very likely that we’ll be able to store most lengths in under a byte.

Following a length, we’ll store the actual offset back. And again, FemtoZip is using Huffman encoding to do that. This is actually being done by encoding the offset into multiple 4 bits entries. The idea is to gain as much as possible from commonalities in the actual byte patterns for the offset.

Yes, it is confusing, and it took me a while to figure it out. But it is also very sleek to see this in action.

On my next post, I’ll bring it all together and attempt to actually generate something that produces compressed data and can decompress it back successfully…

Tags:

Published at

Originally posted at

Comments (6)

Using shared dictionary during compression

After the process of creating the actual dictionary, it is time for us to actually make use of it, isn’t it?

Again, I’m following the code from FemtoZip here, and I’m going to explain how it actually uses the dictionary for compression. The magic is happening in the PrefixHash class. Let us see the calling code:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna West','country':'Nepal','email':'awest@twinte.gov'}");

var prefixHash = new PrefixHash(dic, true);

var bestMatch = prefixHash.GetBestMatch(0, text);
Console.WriteLine(Encoding.UTF8.GetString(dic, bestMatch.BestMatchIndex, bestMatch.BestMatchLength));

The output of this code is: {‘id’:

How does this work?

When we create the prefixHash, it generate the following table by hashing every 4 bytes and storing the relevant position in them.

hash[  5] =  58;
hash[ 11] =  71;
hash[ 13] =  22;
hash[ 14] =  11;
hash[ 17] =  23;
hash[ 30] =  16;
hash[ 34] =  61;
hash[ 35] =   5;
hash[ 37] =  70;
hash[ 41] =  65;
hash[ 45] =  54;
hash[ 49] =  66;
hash[ 57] =  36;
hash[ 58] =  29;
hash[ 63] =  57;
hash[ 65] =  60;
hash[ 66] =  56;
hash[ 67] =   2;
hash[ 72] =   0;
hash[ 73] =  28;
hash[ 78] =  62;
hash[ 79] =  35;
hash[ 80] =  51;
hash[ 87] =  55;
hash[ 89] =  15;
hash[ 91] =   9;
hash[ 96] =   7;
hash[ 99] =  67;
hash[100] =  52;
hash[105] =  21;
hash[108] =  25;
hash[109] =  69;
hash[110] =  68;
hash[111] =  64;
hash[118] =  17;
hash[120] =   4;
hash[125] =  33;
hash[126] =   3;
hash[127] =  26;
hash[130] =  18;
hash[131] =  31;
hash[132] =  59;

The hash of {‘id (the first 4 bytes) is 125. And as you can see, that maps to position 33. That means that there is a likelihood that there in position 33 there is a the value {‘id. What we do then is check, and continue to run through the code as long as we have a match.

That is how we can figure out that there is a 6 character match starting at position 33. The actual code is more involved, of course, and we need to figure out if there might be another match, elsewhere in the dictionary, that might serve better. Another issue when we need to actually compress is that while we can use the dictionary for compression, it is also actually possible to use the plain text we are compressing as another dictionary as well. Which is what FemtoZip is doing.

Basically, the logic goes like this. Check the current position for an entry in the dictionary, then check if we already had this value in the plain text we have seen so far. Select the largest match, then output that.

Here is actually using it all:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna Nepal','country':'Nepal','email':'awest@twinte.gov'}");

var substringPacker = new SubstringPacker(dic);
substringPacker.Pack(text, new DebugPackerOutput(), Console.Out);

The output is meant to be human readable, and the compressed text is:

<-43,6>11,'n<-60,6>Anna Nepal<-40,13><-18,8><-68,8>awest@twinte.gov'}

Note that we saved a total of 41 characters due to compression (assuming we don’t count the cost of actually encoding this.

Now, what about those references? They look very strange with those negative numbers. The reason those numbers are negative is actually quite simple. They aren’t dictionary entries, like you would think. Instead, they are back references. In other words, the first <-43,6> call is actually saying: go backward 43 bytes, then copy 6 bytes.

But we just started reading the compressed text, where do we go backward to? The answer is that we go backward into the dictionary. So all the references in the text are always relative to our current position. Let us resolve this compressed string one step at a time.

<-43,6> means go back 43 bytes into the dictionary and copy 6 bytes, giving us a string of:

{‘id’:

Then we have “11,n” literal that we append to the string:

{‘id’:11,’n

Now we need to go 60 bytes back (from the current end of the string) and copy 6 bytes giving us:

{‘id’:11,’name’:’

The literal “Anna Nepal” giving us:

{‘id’:11,’name’:’Anna Nepal

Then we have to go 40 characters back, and copy 13 bytes:

{‘id’:11,’name’:’Anna Nepal’,’country’:’

Now this is fun, we have to go 18 chars back, and for the first time, we aren’t hitting the dictionary, we are using the actual string that we uncompressed to generate the rest of the string:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,

Another backward reference, 68 steps and copying 8 bytes (again to the dictionary):

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’

The literal awest@twinte.gov’} completes the picture, giving us the full text:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’awest@twinte.gov’}

And that is how FemtoZip works. And that is pretty neat.

The actual implementation is doing Huffman compression as well, but I’ll touch on that in a later post.

Tags:

Published at

Originally posted at

Shared dictionary generation

As I said, generating a shared dictionary turned out to be a bit more complex than I thought it would be. I hoped to be able to just use a prefix tree and get the highest scoring entries, but that doesn’t fly. I turned to femtozip to see how they do that, and it became both easier and harder at the same time.

They are doing this using a suffix array and LCP. I decided to port this to C# so I can play with this more easily. We start with the following code:

 var dic = new DictionaryOptimizer();

 dic.Add("{'id':1,'name':'Ryan Peterson','country':'Northern Mariana Islands','email':'rpeterson@youspan.mil'");
 dic.Add("{'id':2,'name':'Judith Mason','country':'Puerto Rico','email':'jmason@quatz.com'");
 dic.Add("{'id':3,'name':'Kenneth Berry','country':'Pakistan','email':'kberry@wordtune.mil'");

 var optimize = dic.Optimize(512);

This gives me an initial corpus to work with. And let us dig in and figure out what is going how it works. Note that I use a very small sample to reduce the amount of stuff we have to go through.

The first thing that FemtoZip is doing is to concat all of those entries together and generate a suffix array. A suffix array is all the suffixes from the combined string, and part of it, for the string above, is:

ariana Islands','email':'rpeterson@yousp
ason','country':'Puerto Rico','email':'j
ason@quatz.com'{'id':3,'name':'Kenneth B
atz.com'{'id':3,'name':'Kenneth Berry','
berry@wordtune.mil'
co','email':'jmason@quatz.com'{'id':3,'n
com'{'id':3,'name':'Kenneth Berry','coun
country':'Northern Mariana Islands','ema
country':'Pakistan','email':'kberry@word
country':'Puerto Rico','email':'jmason@q
d':1,'name':'Ryan Peterson','country':'N
d':2,'name':'Judith Mason','country':'Pu
d':3,'name':'Kenneth Berry','country':'P
dith Mason','country':'Puerto Rico','ema
ds','email':'rpeterson@youspan.mil'{'id'
dtune.mil'
e':'Judith Mason','country':'Puerto Rico
e':'Kenneth Berry','country':'Pakistan',
e':'Ryan Peterson','country':'Northern M
e.mil'
email':'jmason@quatz.com'{'id':3,'name':
email':'kberry@wordtune.mil'
email':'rpeterson@youspan.mil'{'id':2,'n
enneth Berry','country':'Pakistan','emai

The idea is to generate all the suffixes from the string, then sort them. Then use LCP (longest common prefix) to see what is the shared prefix between any two consecutive entries.

Together, we can use that to generate a list of all the common substrings. Then we start ranking them by how often they appear. Afterward, it is a matter of selecting the most frequent items that are the largest, so our dictionary entries will show be as useful as possible.

That gives us a list of potential entries:

','country':'
,'country':'
'country':'
','email':'
country':'
,'email':'
ountry':'
,'name':'
'email':'
untry':'
email':'
'name':'
ntry':'
name':'
mail':'
son','country':'
on','country':'
n','country':'
','country':'P
,'country':'P
{'id':
try':'
ame':'
ail':'
'country':'P
country':'P
ountry':'P
untry':'P
ntry':'P
ry':'
me':'
il':'
'id':
try':'P
ry':'P
y':'P
.mil'
y':'
n','
l':'
id':
e':'
eterson
terson
son@
mil'
':'P
erson
rson
erry
ason

One thing you can note here is that there are a lot of repeated strings. country appears in a lot of permutations, so we need to clear this up as well, remove all the entries that are overlapping, and then pack this into a final dictionary.

The dictionary resulting from the code above is:

asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'

This contains all the repeated strings that have been deemed valuable enough to put into the dictionary.

On my next post, I’ll talk on how to make use of this dictionary to actually handle compression.

Tags:

Published at

Originally posted at

Comments (5)

Building a shared dictionary

This turned out to be a pretty hard problem. I wanted to do my own thing, but for reference, femtozip is considered to be the master source for such things.

The idea of a shared dictionary system is that you have a training corpus, that you use to extract common elements from the training corpus, which you can then use to build a dictionary, which you’ll then be able to use to compress all the other data.

In order to test this, I generated 100,000 users using Mockaroo. You can find the sample data here: RandomUsers.

The data looks like this:

{"id":1,"name":"Ryan Peterson","country":"Northern Mariana Islands","email":"rpeterson@youspan.mil"},
{"id":2,"name":"Judith Mason","country":"Puerto Rico","email":"jmason@quatz.com"},
{"id":3,"name":"Kenneth Berry","country":"Pakistan","email":"kberry@wordtune.mil"},
{"id":4,"name":"Judith Ortiz","country":"Cuba","email":"jortiz@snaptags.edu"},
{"id":5,"name":"Adam Lewis","country":"Poland","email":"alewis@muxo.mil"},
{"id":6,"name":"Angela Spencer","country":"Poland","email":"aspencer@jabbersphere.info"},
{"id":7,"name":"Jason Snyder","country":"Cambodia","email":"jsnyder@voomm.net"},
{"id":8,"name":"Pamela Palmer","country":"Guinea-Bissau","email":"ppalmer@rooxo.name"},
{"id":9,"name":"Mary Graham","country":"Niger","email":"mgraham@fivespan.mil"},
{"id":10,"name":"Christopher Brooks","country":"Trinidad and Tobago","email":"cbrooks@blogtag.name"},
{"id":11,"name":"Anna West","country":"Nepal","email":"awest@twinte.gov"},
{"id":12,"name":"Angela Watkins","country":"Iceland","email":"awatkins@izio.com"},
{"id":13,"name":"Gregory Coleman","country":"Oman","email":"gcoleman@browsebug.net"},
{"id":14,"name":"Andrew Hamilton","country":"Ukraine","email":"ahamilton@rhyzio.info"},
{"id":15,"name":"James Patterson","country":"Poland","email":"jpatterson@skippad.net"},
{"id":16,"name":"Patricia Kelley","country":"Papua New Guinea","email":"pkelley@meetz.biz"},
{"id":17,"name":"Annie Burton","country":"Germany","email":"aburton@linktype.com"},
{"id":18,"name":"Margaret Wilson","country":"Saudia Arabia","email":"mwilson@brainverse.mil"},
{"id":19,"name":"Louise Harper","country":"Poland","email":"lharper@skinder.info"},
{"id":20,"name":"Henry Hunt","country":"Martinique","email":"hhunt@thoughtstorm.org"}

And what I want to do is to run over the first 1,000 records and extract a shared dictionary. Actually generating the dictionary is surprisingly hard. The first thing I tried is a prefix tree of all the suffixes. That is, given the following entries:

banana
lemon
orange

You would have the following tree:

  • b
    • ba
      • ban
        • bana
          • banan
            • banana
  • a
    • an
      • ana
        • anan
          • anana
      • ang
        • ange
  • n
    • na
      • nan
        • nana
      • nag
        • nage
  • l
    • le
      • lem
        • lemo
          • lemon
  • o
    • or
      • ora
        • oran
          • orang
            • orange
  • r
    • ra
      • ran
        • rang
          • range
  • g
    • ge
  • e

My idea was that this will allow me to easily find all the common substrings, and then rank them. But the problem is how do I select the appropriate entries that are actually useful? That is the part where I gave up my simple to follow and explain code and dived into the real science behind it. More on that in my next entry, but in the meantime, I would love it if someone could show me simple code to find the proper terms for the dictionary.

Tags:

Published at

Originally posted at

Comments (4)