Ayende @ Rahien

Refunds available at head office

Regex vs. string.IndexOf

I send a piece of code to Justin, which dealt with doing some simple text parsing. His comment was:

text.Substring(lastIndex, currentIndex - lastIndex);

Dude, Regex, dude!

This code reminds me of when I wrote an XML parser in ASP3

The reason that I used IndexOf there was performance, this piece of code is in the critical path, and I don't think that Regex would give me much there. But Justin said that compiled Regex is more efficient than IndexOf, so I decided to check it.

Here is my quick perf test:

static void Main(string[] args)
{
	string testStr = "select foo, bar, x, y, z, 5 from Items";
	int count = 500000;
	DateTime start = DateTime.Now;
	for (int i = 0; i < count; i++)
	{
		int last = 0, current = 0;
		while ((current = testStr.IndexOf(',', current)) != -1)
		{
			string x = testStr.Substring(last, current - last);
			current = last = current + 1;
		}
	}
	Console.WriteLine(DateTime.Now -start);
	start = DateTime.Now;
	Regex r = new Regex(",", RegexOptions.Compiled);
	for (int i = 0; i < count; i++)
	{
		int last = 0, current = 0;
		Match match = r.Match(testStr, current);
		while (match.Success)
		{
			current = match.Index;
			string y = testStr.Substring(last, current - last);
			current = last = current + 1;
			match = r.Match(testStr, current);
		}
	}
	Console.WriteLine(DateTime.Now - start);
}

The results, on my machine:

String.IndexOf:   00.2343750
Compiled Regex: 01.4687500

Comments

Marcin Seredynski
11/03/2007 08:45 PM by
Marcin Seredynski

Try running it for longer input string and longer pattern to be found. Regular expressions are faster. Of course everything depends on the scenario for which you're optimizing.

Data I used:

string testStr = "THISISJUSTATESTabcdefghijklmnopqrstuvwxyz0123456789!987654321zyxwvutsrqpomnlkjihgfedcbaTSETATSUJSISIHT";

pattern: "TSET_"

Results:

String.IndexOf: 00:00:00.4200000

Compiled Regex: 00:00:00.2970000

Ayende Rahien
11/03/2007 08:53 PM by
Ayende Rahien

Can you share the full source code?

When I added this:

StringBuilder sb = new StringBuilder(testStr);

for (int i = 0; i < 50; i++)

{

sb.Append(testStr);

}

testStr = sb.ToString();

The results where:

00:00:12.3750000

00:01:11.2500000

After changing the pattern to "from", with the same change as above:

00:00:11.4843750

00:00:15.9687500

Chad Myers
11/03/2007 08:59 PM by
Chad Myers

Yeah, Regex for a simple split is overkill. It only starts showing value when things get complex. I'm also thinking it might be better for really-large-strings too. I wonder if IndexOf starts breaking down if the string is large.

One question: Why didn't you use testStr.Split()? I added that to your test and found that it was slightly slower (0.301 vs. 0.177) but it's much clearer in the code.

Marcos
11/03/2007 09:05 PM by
Marcos

Using in the IndexOf:

StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase

you can speed up even more the IndexOf operation, that way you avoid all the Culture specific code =)

Cheers

Ayende Rahien
11/03/2007 09:12 PM by
Ayende Rahien

Chad,

Because the code that I really have there is doing string parsing, not splitting, for instance, I need to handle ',' in a different way.

Marcin Seredynski
11/03/2007 09:25 PM by
Marcin Seredynski

Ayende: sorry, I skimmed over your code without actually reading it :). I didn't notice the string joining part, and my "_TSET" pattern was matched only once, so the benchmark wasn't very fair. Updated version of the code:

    static void Main(string[] args)

    {

        string testStr = "THIS_IS_JUST_A_TEST_abcdefghijklmnopqrstuvwxyz0123456789!987654321zyxwvutsrqpomnlkjihgfedcba_TSET_A_TSUJ_SI_SIHT";

        for (int q = 0; q < 3; q++)

            testStr += testStr;


        int count = 500000;

        DateTime start = DateTime.Now;

        for (int i = 0; i < count; i++)

        {

            int last = 0, current = 0;

            while ((current = testStr.IndexOf("TSET_", current)) != -1)

            {

                string x = testStr.Substring(last, current - last);

                current = last = current + 1;

            }

        }

        Console.WriteLine(DateTime.Now - start);

        Regex r = new Regex("TSET_", RegexOptions.Compiled);

        start = DateTime.Now;

        for (int i = 0; i < count; i++)

        {

            int last = 0, current = 0;

            Match match = r.Match(testStr, current);

            while (match.Success)

            {

                current = match.Index;

                string y = testStr.Substring(last, current - last);

                current = last = current + 1;

                match = r.Match(testStr, current);

            }

        }

        Console.WriteLine(DateTime.Now - start);

    }

00:00:02.9800000 (IndexOf) vs 00:00:02.0430000 (Regex)

Ayende Rahien
11/03/2007 09:35 PM by
Ayende Rahien

On my machine:

00:00:03.1250000 - string.IndexOf

00:00:03.8750000 - RegEx

Marcin Seredynski
11/03/2007 09:45 PM by
Marcin Seredynski

Interesting... What CPU / architecture / OS / .NET runtime / build config? Mine is: Intel Core 2 CPU / T7400 @ 2.16GHz / Windows Vista / .NET 2.0.50727.312 / default VS 2005 Debug build config.

Bob Grommes
11/03/2007 09:47 PM by
Bob Grommes

@Marcos, you are correct, using the StringComparison.Ordinal or OrdinalIgnoreCase options for IndexOf(), StartsWith(), etc. can be quite a savings. There are a LOT of strings in most programs that don't require culturally-sensitive handling. Often, virtually all of the strings in the critical path will be straight ASCII. It's a great optimization and much under-used.

@Ayende, I would maintain that for most real-world applications outside of large scale text searching or transformations, Regex tends to add complexity without giving back all that much. Regex tends to add a "WTF?" factor into quickly reading and understanding code. Avoiding that is often worth the modest real-world performance penalty of not using regex, even when it exists.

Ayende Rahien
11/03/2007 09:55 PM by
Ayende Rahien

Perntium D Dual Core, 3.00 Ghz, Windows 2003, .Net 2.0.50727, debug build

Niki
11/03/2007 11:00 PM by
Niki

"But Justin said that compiled Regex is more efficient than IndexOf, so I decided to check it."

I think the problem is that the term "efficient" is ambiguous: In this context, it can either mean:

a) raw machine speed, as in "takes 7.538 ms on machine A, with input data B"

b) computational complexity, as in "O(n^2)"

System.Text.RegularExpressions (AFAIK) implements the Boyer-Moore-Algorithm, which is of course better in terms of computational complexity, but can be slower for short search patterns. So if speed is your only concern, you need to know how long your search patterns will probably be to make a decision. Everything else is voodoo.

@Bob: I'd argue that Regex's are better for more complex parsing jobs because they're declarative: You don't tell the computer how to parse a string, but what you're looking for. Of course there's a learning curve, but once you understood how they work they're a blessing (IMO). If I had to choose between a 20-line parsing function written in C#, with nested loops and hundreds of places to hide programming errors, and a 1-line-regex, I'd take the regex (almost) every time. It might take you five seconds longer to understand it for the first time, but it might take hours to find that filthy little bug in the 20-line parsing code...

Evan
11/03/2007 11:10 PM by
Evan

It'll be nice to put a regex extension on the string class in c# 3.0...

Ayende Rahien
11/03/2007 11:15 PM by
Ayende Rahien

Niki,

The string to search is a SQL string, which can be up to 1Kb in size (very rarely).

The string to search is ","

In this context, efficient means that I can't see a way for a regex impl to do something faster than simple string search.

Marcin Seredynski
11/03/2007 11:47 PM by
Marcin Seredynski

Ayende: I have no idea what's causing the difference in the results we're getting... We're using pretty much the same architecture, if I'm not mistaken.

That regex performance observation is pretty interesting, though.

Eamon Nerbonne
11/04/2007 02:48 PM by
Eamon Nerbonne

On an Athlon 64 3000+, using the updated code and "Release" mode outside VS.NET:

IndexOf: around 6 sec. (5.6 to 6.2 spread)

Regex: around 4 sec (3.8 to 4.5 spread)

same executable run under mono:

IndexOf: 13.641

Regex: 36.297

Running with a debugger attached can cause odd performance quirks, esp. if the code has a managed/unmanaged boundary (i.e. generally not with such simple code), as can running a debug build.

I would generally prefer regular expressions, were it not that they're not integrated into C# particularly intuitively. Using explicit capture groups as so:

new Regex("^((?[^,]),)([^,]*)$", RegexOptions.Compiled | RegexOptions.CultureInvariant|RegexOptions.ExplicitCapture);

is in my opinion the most maintainable usage pattern, which you'ld extract information from using Groups:

            foreach(Capture cap in r2.Match(testStr).Groups["group"].Captures) {

                string z = cap.Value;

            }

Of course, it's not exactly always obvious how to modify such regular expression, but modifying complex parsers written in hand-coded procedural style isn't ever easier either, irrespective of the language.

I'd use string.Split if that's suitable, which it is in this example, and otherwise a Regex for maintainability.

Paul Cowan
11/05/2007 12:56 PM by
Paul Cowan

I ran into a performance problem with Cuyahoga and Regexes.

You should always use static methods of the Regex like so:

Regex.IsMatch

The test and I seem to remember the RegexOptions.Compiled flag is bad for performance.....

Ramon
11/05/2007 02:02 PM by
Ramon

I want to add this one to be complete. I think that it is much more readable but it is slow :-)

    start = DateTime.Now;

    r = new Regex("([^,]+),", RegexOptions.Compiled);

    for (int i = 0; i < count; i++){

        MatchCollection matches = r.Matches(testStr);

        for(int j=0;j<matches.Count; ++j){

            string z = matches[j].Value;

        }

    }

    Console.WriteLine(DateTime.Now - start);
Derick Bailey
11/05/2007 10:54 PM by
Derick Bailey

i think you accounted for the overhead of compiling the regex incorrectly.

you have "start = DateTime.Now" before the regex compile. A better test would be to compile the regex before setting the start time.

the reason i suggest this, is to show that you only have to compile the regex once, during the initializing of your application (during a static constructor or something on the object that needs it). after that, the regex will likely be much faster.

but as many previous comments said - it all depends on your scenario and how large the string is.

also - be careful about constantly regenerating compiled regex. i have heard that there is a memory leak caused by this (same problem as using codedom to compile code on the fly). not sure about this, though.

Derick Bailey
11/05/2007 11:26 PM by
Derick Bailey

I've posted an optimized version of the regex so that it properly uses regex functionality instead of trying to mimic what IndexOf does. see my blog:

http://www.avocadosoftware.com/csblogs/dredge/archive/2007/11/05/776.aspx

Derick Bailey
11/05/2007 11:32 PM by
Derick Bailey

looks like the IndexOf is faster in this case... just found a huge bug in the code i posted on my blog, and have corrected it...

Comments have been closed on this topic.