Ayende @ Rahien

Refunds available at head office

String Interning: The Garbage Collectible way

Since I know people will want the actual implementation, here is a simple way of handling string interning in a way that will allow you to GC the results at some point. The issue is simple, I want to intern strings (so a string value is only held once through my entire app), but I don’t want to be stuck with them if the profiler state has been clear, for example.

public class GarbageCollectibleStringInterning
{
    private static IDictionary<string,string> strings = new Dictionary<string,string>();

    private static ReaderWriterLockSlim locker = new ReaderWriterLockSlim();
    
    public static void Clear()
    {
        locker.EnterWriteLock();
        try
        {
            strings.Clear();
        }
        finally
        {
            locker.ExitWriteLock();
        }
    }
    
    public static string Intern(string str)
    {
        string val;
        
        locker.EnterReadLock();
        try
        {
            if(strings.TryGetValue(str, out val))
                return val;
        }
        finally
        {
            locker.ExitReadLock();
        }
        
        locker.EnterWriteLock();
        try
        {
            if(strings.TryGetValue(str, out val))
                return val;
                
            strings.Add(str,str);
            return str;
        }
        finally
        {
            locker.ExitWriteLock();
        }
    }
}

This is a fairly simple implementation, a more complex one may try to dynamically respond to GC notification, but I think that this would be useful enough on its own.

Using this approach, I was able to reduce used memory in the profiler by over 50%. I gave up on that approach, however, because while it may reduce the memory footprint, it doesn't actually solve the problem, only delay it.

Comments

Robert
12/27/2009 10:38 AM by
Robert

Any reason for which you don't use EnterUpgradeableReadLock in Intern method?

Ayende Rahien
12/27/2009 11:05 AM by
Ayende Rahien

Robert,

I'll answer that in a separate post.

Ayende Rahien
12/27/2009 11:16 AM by
Ayende Rahien

Robert,

The short answer is that EnterUpgradeableReadLock would produce a far higher locking rate than read/write.

Imran
12/27/2009 11:25 AM by
Imran

Another thing you can add to reduce the memory footprint further is utf8 encoding. Store the strings as a byte[] instead by calling Encoding.UTF8.GetBytes(string) to circumvent the default .net utf16 encoding. You can use Encoding.UTF8.GetString to convert back to string as needed. Some characters will still be encoded with 2 bytes but for most ascii strings this will give a memory benefit. As you said before though, it doesn't solve your problem it just delays it.

Ken Egozi
12/27/2009 01:50 PM by
Ken Egozi

why Dict <string,string> and not HashSet <string?

I know, I know, delay vs. solve.

Ayende Rahien
12/27/2009 02:42 PM by
Ayende Rahien

Imran,

That would keep allocating new strings every time I do the conversion back. I might use less memory in total, but I would allocate a lot more (more GC work)

Ayende Rahien
12/27/2009 02:43 PM by
Ayende Rahien

Ken,

Because I there is no way with a HashSet to get the value matching, and I care about the particular reference that I use.

Jon
12/27/2009 04:46 PM by
Jon

Why not leave in the string interning code? Even if you do only delay the eventual a 50% decrease in memory sounds like a great thing.

Ayende Rahien
12/27/2009 05:10 PM by
Ayende Rahien

Jon,

Because when you do perf testing, you want to make as few changes as you possibly could, fixing one problem at a time.

The worst thing you can do is to try to patch over the root problem

Dmitry
12/27/2009 06:59 PM by
Dmitry

Have you decided to persist strings to a temp storage instead of keeping them in memory?

Mike Chaliy
12/27/2009 09:42 PM by
Mike Chaliy

+1 to Dmitry, for example Sqlite

Leon Breedt
12/27/2009 09:43 PM by
Leon Breedt

Have you considered a fixed-size buffer of strings, backed by disk storage? e.g. circular buffer of some sort.

Since I'm assuming your massive collection is append only (e.g. no-one comes and removes a string from the middle).

Your disk storage would then be indexed by line number, as would entries in your buffer....Someone tries to scroll outside of the starting and finishing bounds of your buffer, and you load in as many strings as appropriate from storage.

You'll want to optimize the disk storage for reading obviously, if you're using some form of list virtualization, you would probably want to have a mapping between line number and offset in disk file so you can do a Seek() to starting line and start reading.

Constrained memory usage, at the expense of a bit of a lag when scrolling far outside the current bounds.

Rafal
12/27/2009 10:26 PM by
Rafal

@Leon & others

It doesn't make sense to 'scroll' through the buffer of text messages if the buffer contains a gigabyte of data or more. You won't find anything by just scrolling, you need a search engine... So, it means keeping the raw text in memory doesn't make sense, much better to keep just the index in memory and retrieve raw text only when drilling down into details.

@Ayende,

My colleague has recently built an OLAP cube on IIS log files and uses that cube to analyze the application behavior. It was astonishing how much information he retrieved - he was able to create application performance graphs, analyze activity of users, identify application URLS that have too long execution time or transfer too much data, find errors in integration interfaces, identify several bottlenecks, everything by just slicing & dicing the cube..

Maybe something similar would be useful in your profiler? Statistical analysis is a very effective profiling method especially when you have to analyze large sets of data and today we are talking about amounts like millions of rows of data per day. We are using the OLAP as a capacity management tool - we process large amounts of information (think months of iis logs) and analyze the patterns and trends that emerge over time so we have a chance to identify performance problems before they are noticed by customers.

Henrique
01/06/2010 10:00 PM by
Henrique

There is a problem in this case, if you intend to have only one string reference for a given string, you will fail.

if you do something like this, for example:

lock(String.Intern(product.Id))

{

}

The "Clear" method will break this kind of trick.

Perhaps, WeakReference would be useful here

Ayende Rahien
01/06/2010 10:03 PM by
Ayende Rahien

Hernique,

I would NEVER do something like that

Comments have been closed on this topic.