Timing the time it takes to parse time: Part II

architecture (624) rss
bugs (451) rss
community (383) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (73) rss
miscellaneous (592) rss
performance (397) rss
programming (1109) rss
raven (1478) rss
ravendb.net (565) rss
reviews (184) rss

2025
- December (3)
- November (4)
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Oct 11 2016

Timing the time it takes to parse timePart II

time to read 3 min | 536 words

There are times when you write clean, easily to understand code, and there are times when you see 50% of your performance goes into DateTime parsing, at which point you’ll need to throw nice code out the window, put on some protective gear and seek out that performance hit that you need so much.

Note to the readers: This isn’t something that I recommend you’ll do unless you have considered it carefully, you have gathered evidence in the form of actual profiler results that show that this is justified, and you covered it with good enough tests. The only reason that I was actually able to do anything is that I know so much about the situation. The dates are strictly formatted, the values are stored as UTF8 and there are no cultures to consider.

With that said, it means that we are back to C’s style number parsing and processing:

Note that the code is pretty strange, we do upfront validation into the string, then parse all those numbers, then plug this in together.

The tests we run are:

Note that I’ve actually realized that I’ve been forcing the standard CLR parsing to go through conversion from byte array to string on each call. This is actually what we need to do in RavenDB to support this scenario, but I decided to test it out without the allocations as well.

All the timings here are in nanoseconds.

Note that the StdDev for those test is around 70 ns. And this usually takes about 2,400 ns to run.

Without allocations, things are better, but not by much. StdDev goes does to 50 ns, and the performance is around 2,340 ns, so there is a small gain from not doing allocations.

Here are the final results of the three methods:

Note that my method is about as fast as the StdDev on the alternative. With an average of 90 ns or so, and StdDev of 4 ns. Surprisingly, LegacyJit on X64 was the faster of them all, coming in at almost 60% of the LegacyJit on X86, and 20% faster than RyuJit on X64. Not sure why, and dumping the assembly at this point is quibbling, honestly. Our perf cost just went down from 2400 ns to 90 ns. In other words, we are now going to be able to do the same work at 3.66% of the cost. Figuring out how to push it further down to 2.95% seems to insult the 96% perf that we gained.

And anyway, that does leave us with some spare performance on the table if this ever become a hotspot again*.

* Actually, the guys on the performance teams are going to read this post, and I’m sure they wouldn’t be able to resist improving it further Smile .

Tweet Share Share 28 comments

Tags:

Comments

11 Oct 2016
09:38 AM

njy

@Oren, I assume the big perf hit in the framework implementation is that first it parse the format string, then it create a parser and only then it parse the actual input string. And it will do this for each invocation of DateTime.ParseXYZ. A better approach would be to have a DateTimeParser object or something like that, that should be possible to create only once upfront and then reuse it all the time. In fact the same approach has been taken in Objective-C and their Foundation framework, specifically the NSDateFormatter class (see https://developer.apple.com/reference/foundation/nsdateformatter) which is used to to string->date and date->string conversions.

The typical old-style approach of the BCL of having very quick to use methods is nice to use, but for perf intensive scenarios they should consider this othe approach: they should take into account this parser/formatter based design, and than re-implement the standard methods using this underneath.

I suspect that by just using this approach and instancing only one parser/formatter would speed things up a lot.

What do you think?

11 Oct 2016
09:51 AM

Oren Eini

njy, That is my thought as well. Actually, the code for this is here: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Globalization/DateTimeParse.cs#L3488

And it isn't actually generating a parser, they just have a big switch statement to handle all the possibilities. A dedicated formatter / parser would be much faster, yes. This is the second time that we had to write our own code to get something much faster with dates.

Then again, we are talking about something that can process about 400 K per second. We just need it to be much faster (our impl get about 12 million / sec)

11 Oct 2016
10:50 AM

tobi

That first paragraph is so nice :) I love it when we need to get dirty like that.

For RyuJIT code quality increase was a non-goal (puzzlingly) so it's not surprising to see it not faster. It's faster in certain special cases, and legacy JIT is faster in other special cases. The .NET JITs are garbage basically. I do not understand the severe lack of investment in them.

11 Oct 2016
10:56 AM

Oren Eini

Tobi, I would strongly disagree with you here.

We opened several performance issues with the JIT, and we had always gotten enthusiastic support behind "let get this faster". This particular issue is logged here: https://github.com/dotnet/coreclr/issues/7564

And I expect to have a reason why this is happening and a fix at some point.

I think that a large part of RyuJIT is that it needed to allow more platforms and likely better architecture.

11 Oct 2016
11:47 AM

Oleg Mihailik

I say split DateTime and DateTimeOffset into two separate methods.

Better chance for locality (especially for large 'boring' jobs of repetitive data).
JIT might be happier with smaller methods.
No custom enum -- again JIT could be happier with builtins (it might even be clever enough to use CPU flags for returning boolean returns as grownup compilers do).

11 Oct 2016
11:55 AM

Oren Eini

Oleg, I would love to see benchmark results that shows a difference here.

11 Oct 2016
12:49 PM

Andrew Murphy

My understand of the thought behind RyuJIT was that it wanted to focus more on upfront JITing time rather than code execution time.

Or, more accurately, the legacy 64bit JITer was meant for use on servers not desktops, and so time was spent making sure that the code it generated ran the fastest possible but not much time was spent making sure the JITer ran fast. RyuJIT was created, in part, to deal with that problem.

They do seem to be spending more time, now, on improving the generated code, which is great.

11 Oct 2016
13:46 PM

Stuart Turner

Minor code review point - on line 22 of your first code sample, you are checking buffer[16] twice.

11 Oct 2016
18:20 PM

Daniel

Is TryParseNumber getting inlined into the caller? If not, the optimizer cannot really generate good code for this function.

The optimizer must consider the possibility that ptr points to the same memory as the val reference. Thus every change of val must be written to memory, the compiler cannot use a CPU register. You might get slightly better performance by computing the value in a local variable and storing the result in the out-parameter only once at the end of the function.

Also, there's a lot of callers that pass size==2 -- you might want to create a separate function that parses these short integers without using a loop.

Of course if TryParseNumber is getting inlined, the compiler is probably already doing these optimizations. But I'm not sure how likely the JIT is to inline this function. I think the classic JIT never inlines functions containing loops.

11 Oct 2016
20:20 PM

Federico Lois

@Daniel, probably we could do far better at the micro-instruction level, however, dont forget we are talking about a more than 1 order of magnitude improvement. Right now TryParseDate doesnt even show-up in the top 10 functions to optimize. Usually optimizing at that level involve writting pretty disgusting code, at this point this ugliness is far more than enough for any practical purpose.

11 Oct 2016
20:48 PM

tobi

RyuJIT does not have good internal architecture. Andy Ayers said in a public talk that the codebase is not fresh, the architecture is baroque and nobody feels quite comfortable with it. Whenever I read the code or read the commit logs it felt like it. It does not even use SSA which is known for a long time to be the correct internal compiler data structure. That's further evidence the code is not fresh.

RyuJIT not being about perf is explicitly stated on the teams blog.

I don't understand why this JIT was created. They should have created an interpreter plus a strong 2nd tier JIT that works unconstrainedly. RyuJIT feels like the result of a project or management failure. If you write a new JIT and all you achieve is 50% faster JIT times plus open source, is that really the point of the project?!

11 Oct 2016
20:52 PM

David Turner

Line 22 looks mildly suspicious:

buffer[16] != ':' || buffer[16] != ':'

11 Oct 2016
20:59 PM

tobi

Just to underline the point I'll quote Mikedn from the linked issue:

One obvious difference is that RyuJIT loads ptr[i] from TryParseNumber 3 times instead of loading it once and keep it in a register.

RyuJIT does not (here and in general) extract multiple loads from the same location into a register. s.x + s.x loads x twice. RyuJIT does not perform the most basics optimizations. I do not know how it's possible to argue that it is a capable JIT. The JVM guys are laughing at this.

12 Oct 2016
03:37 AM

Federico Lois

@tobi are you sure you are not talking about LegacyJIT? Because as far as I know RyuJIT uses (at least in some parts of its data flow) a SSA form. Among the optimization are dead code optimization via liveness analysis, loop hoisting, copy propagation and a few others.

Source: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md#ssa-vn

12 Oct 2016
07:33 AM

Pop Catalin

@tobi it does extract multiple loads into a register.

            for (int i = 0; i < size; i++)
00CF0BD5  xor         ebx,ebx  
00CF0BD7  test        edx,edx  
00CF0BD9  jle         00CF0C08  
            {
                val *= 10;
00CF0BDB  mov         eax,dword ptr [edi]  
00CF0BDD  lea         eax,[eax+eax*4]  
00CF0BE0  add         eax,eax  
00CF0BE2  mov         dword ptr [edi],eax  
                if (ptr[i] < '0' || ptr[i] > '9')
00CF0BE4  movzx       esi,byte ptr [ecx+ebx]   // Here the loads are grouped into a register, they are semantically a byte load
00CF0BE8  cmp         esi,30h  
00CF0BEB  jl          00CF0BF2  
00CF0BED  cmp         esi,39h  
00CF0BF0  jle         00CF0BFB  
                    return false;
00CF0BF2  xor         eax,eax  
00CF0BF4  pop         ebx  
00CF0BF5  pop         esi  
00CF0BF6  pop         edi  
00CF0BF7  pop         ebp  
00CF0BF8  ret         4  

                val += ptr[i] - '0';
00CF0BFB  mov         eax,dword ptr [edi]     // Here the load is not grouped with previous load as it it semantically an int
00CF0BFD  lea         eax,[eax+esi-30h]  
00CF0C01  mov         dword ptr [edi],eax  
            for (int i = 0; i < size; i++)
00CF0C03  inc         ebx  
00CF0C04  cmp         ebx,edx  
00CF0C06  jl          00CF0BDB  
            }
            return true;

12 Oct 2016
07:52 AM

Pop Catalin

@Daniel, Tryparse does not get inlined and adding a local variable does indeed improve performance by about 15% on my machine

25 M dates parses: Original: 00:00:03.0097201 Local Var: 00:00:02.5762892

12 Oct 2016
18:47 PM

tobi

@PopCatlin, this is hardcoded to work in loops. Outside of loops it does not occur. Seems to be hard to make it work in their architecture. For that reason they only addressed the most impactful case which is loops.

@Frederico I did not say the JIT does not perform any optimizations. Clearly it does. It's just quite poor in the level of optimizations that are supported.

13 Oct 2016
08:04 AM

Pop Catalin

@tobi, I'm afraid that's not true, loads are grouped even outside of loops. This basic optimization is done everywhere.

The following code has load optimization without any loops (this is done strictly by the JIT, it is not present in IL):

        private static bool TryParseNumber2(byte* ptr, out int value)
        {
            value = 0;
            if (ptr[0] < '0' || ptr[0] > '9' || ptr[1] < '0' || ptr[1] > '9')
                return false;

            value = (ptr[0]- '0') * 10 + ptr[1] - '0';
            return true;
        }

            if (ptr[0] < '0' || ptr[0] > '9' || ptr[1] < '0' || ptr[1] > '9')
01260C18  movzx       esi,byte ptr [ecx]                          // single load for  ptr[0]
01260C1B  cmp         esi,30h  
01260C1E  jl          01260C33  
01260C20  cmp         esi,39h  
01260C23  jg          01260C33  
01260C25  movzx       eax,byte ptr [ecx+1]  // single load for  ptr[1]
01260C29  cmp         eax,30h  
01260C2C  jl          01260C33  
01260C2E  cmp         eax,39h  
01260C31  jle         01260C38  
                return false;
01260C33  xor         eax,eax  
01260C35  pop         esi  
01260C36  pop         ebp  
01260C37  ret  

            value = (ptr[0]- '0') * 10 + ptr[1] - '0';
01260C38  add         esi,0FFFFFFD0h  
01260C3B  lea         ecx,[esi+esi*4]  
01260C3E  lea         eax,[eax+ecx*2-30h]  
01260C42  mov         dword ptr [edx],eax  
            return true;

13 Oct 2016
10:28 AM

Oren Eini

David, Correct, that was a copy/paste issue

13 Oct 2016
10:41 AM

Oren Eini

Daniel. Very good suggestions. We started out at 97.4 ns, with local variable that cost goes to 86.9, and with dedicated methods for each size, it is down to 79.5

13 Oct 2016
12:23 PM

Theo

Interested article. I love exploring aspects of things within themselves. I recently wrote an article about efforts making it faster to tell time. http://www.circonus.com/time-but-faster/

I found the bar charts of time and stddev a bit awkward. Had you considered using a candlestick graph to visualize the trial runs? https://en.wikipedia.org/wiki/Candlestick_chart

13 Oct 2016
16:24 PM

tobi

That optimization seems to have been added since I last looked. Testing this on Desktop 4.6.2 x64 Release:

    static void GetValue()
    {
        S s = new S();
        s.X = 1;
        Console.WriteLine(s.X + s.X);
    }
}

struct S
{
    public int X;
}

This was simplified to cw(2) which is good news! I still have the test cases I sent to Microsoft, so I know for sure this was not optimized. But this does not appear to really fold the loads... Rather, it blasts the struct to its components.

    static void GetValue()
    {
        S s = new S();
        s.X = 1;
        Console.WriteLine(s.X + s.X);
    }
}

struct S
{
    public int X;
    public int X2;
    public int X3;
    public int X4;
    public int X5;
    public int X6;
    public int X7;
}

And now it's a desaster:

        S s = new S();

000007FE95504200 push rdi
000007FE95504201 sub rsp,40h
000007FE95504205 lea rdi,[rsp+20h]
000007FE9550420A mov ecx,8
000007FE9550420F xor eax,eax
000007FE95504211 rep stos dword ptr [rdi]
000007FE95504213 xor ecx,ecx
000007FE95504215 lea rax,[rsp+20h]
000007FE9550421A xorpd xmm0,xmm0
000007FE9550421E movdqu xmmword ptr [rax],xmm0
000007FE95504222 mov qword ptr [rax+10h],rcx
000007FE95504226 mov dword ptr [rax+18h],ecx
s.X = 1; 000007FE95504229 mov dword ptr [rsp+20h],1
Console.WriteLine(s.X + s.X); 000007FE95504231 mov ecx,dword ptr [rsp+20h]
000007FE95504235 add ecx,dword ptr [rsp+20h]
000007FE95504239 call 000007FEF423F6B0

Struct on the stack, it's a constant, most of it is dead. It moves 1 into X, then loads X twice and adds it. Can it possibly be worse and less optimized? This is what I mean.

13 Oct 2016
18:11 PM

Rik Hemsley

Just wondering: What's driving the need to parse? Would it be possible for whatever your source is to store in a binary format that can be loaded directly?

13 Oct 2016
19:52 PM

Wanderlei

TryParseNumber seems broken. It is not multiplying the positional numbers by its correct powers of 10.

13 Oct 2016
21:08 PM

Oren Eini

Wanderlei, Look at line 92

13 Oct 2016
21:10 PM

Oren Eini

Rik, We could do that, yes, but it make it easier to interchange. The problem with binary format is that then you need to start supporting all sort of various formats (DateTime, DateTimeOffset, TimeSpan), to name just a few. It is better if we have this as a well defined string, because that also helps us to avoid type detection on the fly, which has its own costs.

17 Oct 2016
06:39 AM

Damien

I know I'm late to the party (just back from a week of downtime), but does your code need to only work in half of the world? Or am I missing something else, where you only need to care about positive offset values?

I.e. I believe that 2016-10-05T21:07:32.2082285-03:00 would fail since there's only an explicit check for +.

18 Oct 2016
11:40 AM

Oren Eini

Damien, That is a bug that was corrected, yes.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Timing the time it takes to parse timePart II

More posts in "Timing the time it takes to parse time" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Timing the time it takes to parse time" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication