Ayende @ Rahien

Refunds available at head office

Interview questions: Large text viewer

I mentioned that we are looking for more people to work for us. This time, we are looking for WPF people for working on the profiler, as well another hush hush project that we’ll hopefully be able to reveal in December.

Because we are now looking for mostly UI people, that gives us a different set of challenges to deal with. How do I get a good candidate when my own WPF knowledge is limited to “Um.. dependency properties, man, that the bomb, man, yeah!”.

Add that to the fact that by the time people got an interview here, we want to be sure that they can code, that present an interesting problem. So we come up with questions like this one. Another question we have is the large text viewer.

We need a tool that can work with text file (logs) of huge size (1GB – 10 GB). We want to be able to open and search through such a file.

Nitpicker corner: I usually use this tool for that, the purpose of the question isn’t to actually to get such a tool, it is to see what kind of code the candidate writes.

We are looking for someone with a lot of skill in the UI side of things, so the large text file stuff is somewhat of a red herring, except that we want to see what they can do beyond just slap a few text boxes around.

Comments

Daniel
08/29/2014 11:45 AM by
Daniel

Nitpicker corner: Interesting Tool you linked, doesn't display UTF-8 very well but it definitly beats using 'tail' from unixtools to cut out the interesting part from a long grown huge log file. I'll keep it.

Rik Hemsley
08/29/2014 01:01 PM by
Rik Hemsley

The 'V' file viewer works nicely here, though it does a lot more than just view big files (slightly unfortunate as sometimes that's all you need).

Rik Hemsley
08/29/2014 03:13 PM by
Rik Hemsley

This would be a fun exercise. I'd love some time to play with it.

Years ago I wrote a hex editor for my BBC Micro, which didn't load the whole file into RAM. This was trivial, though, as the size of a page on screen was a fixed number of bytes.

Doing this with variable length lines and variable length characters (UTF) is a more interesting challenge. The number of bytes you'd need to read in a block is easy - it's rowscolumnsmax-bytes-per-char.

The fun part comes when working out how to cache what you've loaded (and how to throw it away) and in dealing with a scrollbar. The 'V' file viewer does what I'd expect - it's scrollbar is a bit 'virtual' - it's not quite accurate but it does kind of what you want.

There's also trying to work with an existing UI component that makes this tricky. You can implement your own, but it'd be much easier if you could re-use. I did a log 'tail' program which used the Windows Forms textbox IIRC, and flushed out old rows. This worked - and didn't leak memory, but it was surprising that it did, and it didn't look too pretty!

Ivan Danilov
08/30/2014 01:40 AM by
Ivan Danilov

The first idea is to read the file searching for endlines and building index of their positions (or better each 100-th endline). On that basis I can build 'data provider' that is able to get me page of data consisting of 100 lines from the file.

And I've already built WPF list control that is able to work with arbitrary number of items at once (both data and UI virtualized). As long as you need only read&search huge files - it would be enough.

Editing would require correct updates to index build on the first stage... which seems doable, but not trivial.

To make tool more convenient it is possible to make data provider starts scanning the file asynchronously. While scanning is not finished it may work in 'unprecise mode', i.e. such that when user takes scroll bar to some percentage - data provider would take data since nearest endline from corresponding file position, not line number. And when background scanning is complete - it will turn to 'precise mode' as it is known precisely how many lines file has.

Estimating memory footprint... currently displayed text is O(1), so not interesting, and the only thing that is growing with file size growth is line-ending index. Assuming 10GB file has average line length of 10 characters (probably any sensible log will have much more), it is 10^9 line breaks, or 10^7 items in the index. These are longs 8 bytes each, so 80Mb for pure data. Dictionary overhead is significant though, but I don't think it will be more than twice-thrice as much. If it is not enough - we may trade some speed for memory, increasing page size thus saving only each 1000-th or 10 000-th line end character and correspondingly making memory consumption 10 or 100-fold less.

The obvious drawback it that enormously long lines (e.g. 10GB file with single line as a corner case) will bring this implementation to a halt. But for normal logs it should work just fine. If you need to work with extremely long lines - I should think how to defend against such cases... But only if it is really a requirement :)

Ivan Danilov
08/30/2014 01:48 AM by
Ivan Danilov

By the way, here is another challenge for UI developer: how to build tree control that is able to work with arbitrary number of items. List (and file viewer control is a list basically) is an easy part. But logs often have some structure that may be convenient for user (e.g. some process started and it is convenient to be able to fold every log message till the point when it has ended) - and adding structure to logs via making them to a tree-like structure will be an advantage. At least when I work with large logs - I often want to have such tool... ;)

Ivan Danilov
08/30/2014 01:53 AM by
Ivan Danilov

Answering myself...

Dictionary overhead is significant though Wow, I'm a dimwitI don't need Dictionary here. List (or even) array would be enough, as we already have data indexed by its position. So 80Mb is the worst case for the case discussed. Just need to try, is it better to allow this structure reside in LOH or divide it in chunks, so that it remains in Gen2... but these are minor details for now.

Ivan Danilov
08/30/2014 01:55 AM by
Ivan Danilov

Oh... totally forgot this blog eats greater-than signs. Previous post's seconds paragraph should be a quote from my first post here.

Sorry for so many posts in a row.

Ayende Rahien
08/30/2014 03:04 PM by
Ayende Rahien

Ivan, Just reading the file is your big issue, you cannot really do that in big files, it takes forever to do so. So you can't just skip every 100 lines, because in a 10 GB file, just doing that would takes close to two minutes.

Ivan Danilov
08/30/2014 05:22 PM by
Ivan Danilov

As soon as I'm doing it asynchronously in background and it does not affect UI responsiveness - it is not an issue. Here we have a trade-off between precise scroll bar (and it is impossible to implement it without scanning entire file) and approximate scroll bar but less load onto hard drive (by working in 'unprecise mode' all the time you don't need to scan more lines than one screen of text can show at once).

I may even implement some heuristics, which tries to scan the file and move to 'precise mode' if file is less than some limit size, or stay in 'unprecise mode' forever otherwise. And limit size - add it to options, so user may control it. For huge files with more-or-less evenly distributed line lengths user won't notice the difference anyway because scroll bar is far from being precise anyway.

Ivan Danilov
08/30/2014 05:32 PM by
Ivan Danilov

I will try to clarify what I mean by 'unprecise mode' - just in case.

Suppose user left scroll bar at 50%. Then I will seek the file to position 0.5*fileLength, read since this position till I find first end-of-line. Next character would be my first visible on screen. So next I will read the file in blocks until either I read enough lines to fill what user can see on the screen or EOF (maybe plus some number of lines back and forth so that scrolling by a line above or below work quick). And that's it - no reading huge amount of data, but sacrificing a bit of scroll bar's accuracy.

Again, if large percentage of file belong to single line and everything else divided between another 1000 of lines - this approach would give a bit strange UX: large section of scroll bar will bring user same position while other places will change lines quickly.

Michael Brown
09/01/2014 03:55 AM by
Michael Brown

Would completing such a task be considered a resume submission? I've been working with WPF since 2006 (beta release) and was CAD MVP for five years running. I've implemented custom look less controls and was a pioneer with WPF contributing to the original Prism release. My old WPF Blog is up at http://kharasoft.wordpress.com

I'd love to join you in some crazy adventures.

Michael Brown
09/01/2014 03:56 AM by
Michael Brown

Friggin autocorrect. I meant lookless controls. I.E. fully templatable controls.

Greg Young
09/01/2014 04:48 PM by
Greg Young

MapViewOfFile (say 1mb)

Set scroll bars as rough guestimate based on file size.

Greg

Sergey Shumov
09/03/2014 07:40 AM by
Sergey Shumov

Here is my approach to the problem https://github.com/SHSE/LogWatch

The idea is to perform an initial fast scan of the log file to create a map of records and extract some basic information for each record.

The scan operation executes on a background thread. Records are loaded into memory only if they are shown.

Thomas
10/02/2014 02:00 PM by
Thomas

This could be done easily with rx

//open a 4GB file for asynchronous reading var file = new FileStream(@"d:\4GB.txt", FileMode.Open, FileAccess.Read, FileShare.Read, 2 << 15, true); // blocks of 64K

file.AsyncRead(2 << 15) .WriteToStream(memoryOrSomething) .Subscribe( _=> Console.WriteLine("Reading file."), error=> Console.WriteLine( "An error occurred while reading the file: {0}", error.Message));

Ayende Rahien
10/02/2014 04:49 PM by
Ayende Rahien

Thomas, You are reading the entire file, but that isn't really required.