Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,953 | Comments: 44,406

filter by tags archive

String Interning: The Garbage Collectible way


Since I know people will want the actual implementation, here is a simple way of handling string interning in a way that will allow you to GC the results at some point. The issue is simple, I want to intern strings (so a string value is only held once through my entire app), but I don’t want to be stuck with them if the profiler state has been clear, for example.

public class GarbageCollectibleStringInterning
{
    private static IDictionary<string,string> strings = new Dictionary<string,string>();

    private static ReaderWriterLockSlim locker = new ReaderWriterLockSlim();
    
    public static void Clear()
    {
        locker.EnterWriteLock();
        try
        {
            strings.Clear();
        }
        finally
        {
            locker.ExitWriteLock();
        }
    }
    
    public static string Intern(string str)
    {
        string val;
        
        locker.EnterReadLock();
        try
        {
            if(strings.TryGetValue(str, out val))
                return val;
        }
        finally
        {
            locker.ExitReadLock();
        }
        
        locker.EnterWriteLock();
        try
        {
            if(strings.TryGetValue(str, out val))
                return val;
                
            strings.Add(str,str);
            return str;
        }
        finally
        {
            locker.ExitWriteLock();
        }
    }
}

This is a fairly simple implementation, a more complex one may try to dynamically respond to GC notification, but I think that this would be useful enough on its own.

Using this approach, I was able to reduce used memory in the profiler by over 50%. I gave up on that approach, however, because while it may reduce the memory footprint, it doesn't actually solve the problem, only delay it.


Comments

Robert

Any reason for which you don't use EnterUpgradeableReadLock in Intern method?

Ayende Rahien

Robert,

I'll answer that in a separate post.

Ayende Rahien

Robert,

The short answer is that EnterUpgradeableReadLock would produce a far higher locking rate than read/write.

Imran

Another thing you can add to reduce the memory footprint further is utf8 encoding. Store the strings as a byte[] instead by calling Encoding.UTF8.GetBytes(string) to circumvent the default .net utf16 encoding. You can use Encoding.UTF8.GetString to convert back to string as needed. Some characters will still be encoded with 2 bytes but for most ascii strings this will give a memory benefit. As you said before though, it doesn't solve your problem it just delays it.

Ken Egozi

why Dict <string,string> and not HashSet <string?

I know, I know, delay vs. solve.

Ayende Rahien

Imran,

That would keep allocating new strings every time I do the conversion back. I might use less memory in total, but I would allocate a lot more (more GC work)

Ayende Rahien

Ken,

Because I there is no way with a HashSet to get the value matching, and I care about the particular reference that I use.

Jon
Jon

Why not leave in the string interning code? Even if you do only delay the eventual a 50% decrease in memory sounds like a great thing.

Ayende Rahien

Jon,

Because when you do perf testing, you want to make as few changes as you possibly could, fixing one problem at a time.

The worst thing you can do is to try to patch over the root problem

Dmitry

Have you decided to persist strings to a temp storage instead of keeping them in memory?

Mike Chaliy

+1 to Dmitry, for example Sqlite

Leon Breedt

Have you considered a fixed-size buffer of strings, backed by disk storage? e.g. circular buffer of some sort.

Since I'm assuming your massive collection is append only (e.g. no-one comes and removes a string from the middle).

Your disk storage would then be indexed by line number, as would entries in your buffer....Someone tries to scroll outside of the starting and finishing bounds of your buffer, and you load in as many strings as appropriate from storage.

You'll want to optimize the disk storage for reading obviously, if you're using some form of list virtualization, you would probably want to have a mapping between line number and offset in disk file so you can do a Seek() to starting line and start reading.

Constrained memory usage, at the expense of a bit of a lag when scrolling far outside the current bounds.

Rafal

@Leon & others

It doesn't make sense to 'scroll' through the buffer of text messages if the buffer contains a gigabyte of data or more. You won't find anything by just scrolling, you need a search engine... So, it means keeping the raw text in memory doesn't make sense, much better to keep just the index in memory and retrieve raw text only when drilling down into details.

@Ayende,

My colleague has recently built an OLAP cube on IIS log files and uses that cube to analyze the application behavior. It was astonishing how much information he retrieved - he was able to create application performance graphs, analyze activity of users, identify application URLS that have too long execution time or transfer too much data, find errors in integration interfaces, identify several bottlenecks, everything by just slicing & dicing the cube..

Maybe something similar would be useful in your profiler? Statistical analysis is a very effective profiling method especially when you have to analyze large sets of data and today we are talking about amounts like millions of rows of data per day. We are using the OLAP as a capacity management tool - we process large amounts of information (think months of iis logs) and analyze the patterns and trends that emerge over time so we have a chance to identify performance problems before they are noticed by customers.

Henrique

There is a problem in this case, if you intend to have only one string reference for a given string, you will fail.

if you do something like this, for example:

lock(String.Intern(product.Id))

{

}

The "Clear" method will break this kind of trick.

Perhaps, WeakReference would be useful here

Ayende Rahien

Hernique,

I would NEVER do something like that

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
  2. Special Offer (2):
    27 May 2015 - 29% discount for all our products
  3. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  4. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  5. Interview question (2):
    30 Mar 2015 - fix the index
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats