Interview question: Stackoverflow THAT

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Sep 29 2016

Interview questionStackoverflow THAT

time to read 2 min | 363 words

We are doing perf testing right now, and we are looking into real world datasets to play around with. Luckily for us, Stackoverflow have regular data dump of size significant enough to be useful for our experiments.

The file that I’m currently looking it is the Posts.xml file, which is about 45GB in size, and looks roughly like this (lot of stuff removed to make the point).

Since Stackoverflow is using relational database, their output is also relational. You can see that each element is a single row, and you have the ParentId in row #7 that points back to row #4.

Basically, row #4 is the question, and row #7 is one of the answers.

What I want it to take all of this data and move it into a more document format. In other words, what I want to have all the answers for a question contained within the question, something like this:

The fun part here is that this is a pretty big file, and we are writing the output into a GzipStream, so we don’t really have the option of saving / modifying midway through. Once we have written something out to the GzipStream, it cannot be changed.

So we need to find a way in which we can group all the answers under their questions, but at the same time, the file size is big, much bigger than the memory I have available, so we can’t just keep it all in memory and write it out in the end.

How would you solve this issue? My attempt is currently sitting at roughly 10GB of RAM used after processing about 30GB of XML, but I have to admit that I have thrown it together rather quickly, since I just needed the data and a quick & dirty solution is just fine here.

Tweet Share Share 39 comments

Tags:

challanges

Comments

29 Sep 2016
09:23 AM

Pop Catalin

I have no idea about the data characteristics in the Stackoverflow dump, and whether more optimizations can be applied, however, with a generic solution, I would:

Read N Records in memory (for 50% of available free ram)
Merge Questions and Answers into Documents
Sort them by Question ID 4 Write to temp file (GZipStream). If source is not finished go to 1.
Streaming Read all temp files and Merge all the documents. Write to final file.

29 Sep 2016
09:27 AM

Carsten Hansen

Great! I didn't know that you could make online SQL-queries against the databases: https://data.stackexchange.com/stackoverflow/queries http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

29 Sep 2016
09:41 AM

Alex Davidson

Since it's relational data and this is basically a sorting problem, my approach would be to load it into an SQL Server instance or similar and then write it out in a more appropriate order: ORDER BY COALESCE(ParentId, Id), PostTypeId. That should put each question with its answers and the set can then be processed with minimal RAM.

If the intent is to write a custom tool, however:

Scan the XML source file once, building an index of question IDs, referring to questions and answers by tag start and end offsets in the stream. The questions include AnswerCount information, so if questions always precede their answers in the file we can preallocate the list of answers for each question in our index.
Sort the indexes by question ID (should already be sorted, in fact)
Scan our sorted indexes, using file offsets to find the question and answer records' tags in the right order. These tags could be extracted and parsed as individual XML docs or we could just do some string-bashing.

This assumes that our XML source is a random-access stream rather than a forward-only stream, and that the XML parser exposes stream offset information for tags. Alternatively we could generate tabular temporary files and index those instead.

If our parser lets us identify start and end offsets of attribute values, things get even faster because we can pull out the body value directly in step 3, and since it's already XML-encoded it can be written directly to the output stream.

29 Sep 2016
10:35 AM

ren

Since the data is relational I would use relational db to store it. I would use https://github.com/ren85/linqdb since it's super easy for C# developer. So I would have two tables Questions and Answers and once the data is there iterate the Questions table and for each question query for its answers. (Also I would store text as byte arrays in linqdb to avoid full text index).

29 Sep 2016
10:36 AM

Ian Cross

Parse it twice with a XML SAX Parser?

In the first parse, create something to store just the count for each parent Id in a dictionary so you know how many answers there are for each question. In the second parse, start to build your ouput in memory and check if all answers have been read (hopefully they would be close to the questions). If a complete question has been read and all the answers attached then write the question with answers out to the stream, clearing the memory.

29 Sep 2016
10:56 AM

Jesús López

You can't do that in a single pass. You basically need questions and answers sorted by question id, questions by row id and answers by parentId, then merge answers and questions using the basic merge algorithm. You can store questions in RavenDB using document id in the form question/<row id> and answers in the form answer/<parentId>/<row id>, read them ordered by document id using the streaming API and merge them. You can also store all rows into a SQL Server table using row id as primary key, create an index on parentId and read them using one single FOR XML select statement, SQL Server would likely perform a MERGE JOIN.

29 Sep 2016
12:33 PM

Dennis

Since you have the answercount plus presumably spacial locality in the data, then the problem is not bigger than keeping the title entry in memory until all answers have been read.

29 Sep 2016
13:12 PM

Rémi BOURGAREL

Here is the algorithm :

XmlReader read = new XmlTextReader("Posts.xml");
read.ReadToDescendant("row");
do//extract each post and its child into a separate xml file
 {
    if (parentidattribute != null)
    {
      if(!File.Exists(parentidattribute+".xml"))
        File.Create(parentidattribute+".xml");
      File.Append(read.ReadOuterXml(),parentidattribute+".xml");
   }
   else
   {
      File.Create(idattribute+".xml");
      File.Append(row.OuterXml,idattribute+".xml");
    }
}
while(read.ReadToNextSibling("row"));
foreach(file in currentDir.getFiles("*.xml"))
{
   //create the question.xml file with a xslt transformation
}

29 Sep 2016
13:20 PM

Jesús López

I implemented a more imaginative solution based on the file system. I made some assumptions: input data set is ordered by row id and answers row id are greater than the corresponding question row id. Here is the code https://gist.github.com/jesuslpm/07fb8121b1747bd69fb43581c4a576d9

29 Sep 2016
14:17 PM

Thomas Levesque

plus presumably spacial locality in the data

That's a pretty big assumption, which I don't think is true. It's likelier than the data is in the natural database order, which depends on when the post was created.

29 Sep 2016
16:35 PM

Remco

So, every post has an answer count, shouldn't it be possible to create a dictionary of posts and answers, and flush a post to the output stream only if all answers have been seen?

this code doesn't work if the hierarchy is deeper than 2 levels, but could be adjusted for that.

 public void Merge()
    {
        var posts = new Dictionary<long, Post>();
        var answers = new Dictionary<long, List<Post>>();
        foreach (var post in ReadIn())
        {
            if (HasParent(post))
            {
                List<Post> postAnswers;
                if (!answers.TryGetValue(post.ParentId, out postAnswers))
                {
                    postAnswers = new List<Post>();
                    answers.Add(post.ParentId, postAnswers);
                }
                postAnswers.Add(post);

                Post parentPost;
                if (posts.TryGetValue(post.ParentId, out parentPost) && parentPost.AnswerCount == postAnswers.Count)
                {
                    WriteOut(post, postAnswers);

                    // clear memory, we're done with this parent post...
                    posts.Remove(post.ParentId);
                    answers.Remove(post.ParentId);
                }
            }
        }
    }

    public IEnumerable<Post> ReadIn()
    {
        // loop through xml..
        yield break;
    }


    private void WriteOut(Post post, IEnumerable<Post> postAnswers)
    {
        // ...
    }

    private bool HasParent(Post post)
    {
        // ...
        return post.ParentId != 0;
    }

29 Sep 2016
18:27 PM

Oren Eini

Pop Catalin, That assumes that all the answers to a question would fit into your batch size. It is pretty common to have questions that have answers years afterward, so they are very spread out.

29 Sep 2016
18:28 PM

Oren Eini

Alex, Yes, throwing it through something that would allow sorted access makes things much easier. However, the intent is to not use an existing db.

And the major problem with your suggestion is that XmlReader does buffering, so you can't tell the exact position. Writing to a dedicated file the data and writing the positions to an index, then sorting the index would work, though

29 Sep 2016
18:32 PM

Oren Eini

ren, Interesting, you would have to create a composite key there, and at the scale we are talking about, you'll be waiting for this data to get into linqdb for a LONG while (it creates an index per value, which is _expensive_.

29 Sep 2016
18:32 PM

Oren Eini

Ian, You already have the answer count in the data itself. The problem is that it is likely that there is a big gap between the question being asked and all its answers.

29 Sep 2016
18:33 PM

Oren Eini

Ian, Also, an answer may appear before the question

29 Sep 2016
18:34 PM

Oren Eini

Dennis, There ISN'T spatial locality.

29 Sep 2016
18:34 PM

Oren Eini

Rémi, That would create millions of files, and will likely bork the file system

29 Sep 2016
18:35 PM

Oren Eini

Jesús, Answer id may be smaller than question id.

29 Sep 2016
18:35 PM

Oren Eini

Remco, That was my first attempt, but an answer may exist to a question that was posted years ago, so there is a huge gap and a lot of stuff you need to remember yet.

29 Sep 2016
18:36 PM

Pop Catalin

@Ayende, that's the reason for the document merge as a final step, merge documents with the same question ID from multiple temporary files. The document will contain the question an maybe some answers, the documents with same question ID from other temporary files will only contain answers .

29 Sep 2016
18:46 PM

Alois Kraus

I would create an index where the post Id, parent id and the xml node start file position is stored. This list should be pretty small and be able to sort in memory. Then I can group by post id and parents and sort ascending. Once I have that I can scan through the xml file mostly sequentially and stream out the resulting xml file as needed while taking advantage of the OS file system cache which will do a little read ahead and cache already read section of the file so that although I am seeking around a lot it will be pretty fast.

29 Sep 2016
18:58 PM

tobi

A creative way to solve the XmlReader buffering problem that obscures the position is to write the rows out into another XML file. That way you can record the positions in the new file. It's a costly but rather quick hack. I'd be concerned about the random IO that this causes while reading back, though.

My solution: If perf is not a concern I'd put this into an indexed database. This optimizes for dev time. If this is supposed to be fast I'd find (or write) and external sorting algorithm, sort the posts and merge them together in a streaming fashion. Should be hard to do better than this since fundamentally we need to shuffle all the data around anyway. I'd probably use Protobuf as the intermediate storage format. Various variants of this are possible.

It's probably best to move the text bodies along with the metadata. Not moving them around causes lots of random IO again.

29 Sep 2016
19:00 PM

Alex Davidson

XmlTextReader exposes line and character info. It's probably possible to provide it a TextReader implementation which tracks the stream offsets of the last X linebreaks and enables translation of line/character numbers to stream offsets within a limited range of history. Alternatively, if the XML format is simple enough it might be an option to write a quick'n'dirty stream-scraper which parses a limited yet sufficient subset of XML...

But unless we really want to squeeze it, intermediate temporary files would indeed be my preferred approach!

30 Sep 2016
01:24 AM

Joshua Dale

Have you tried asking Stackoverflow? :)

30 Sep 2016
04:57 AM

Jesús López

How can answer id be smaller than question id if id is auto increment? That would mean that you can answer a question that is not posted yet

30 Sep 2016
07:24 AM

Jesús López

The code I posted previously what it really needs to work is that answers come later than corresponding questions. But it is not difficult to make it to work when an answer comes before the corresponding question, I added a modified version program2.cs to https://gist.github.com/jesuslpm/07fb8121b1747bd69fb43581c4a576d9 that works even when that happens, unfortunately it needs to read posts.xml twice.

30 Sep 2016
07:27 AM

Pop Catalin

@Jesús López, it can happen when using Merge Replicated Servers, that each have their own identity ranges for example. Question is inserted on one server and answer on another.

30 Sep 2016
07:40 AM

Gavan McGregor

I'd approach this problem as a trade-off between memory and disk. If you have 135GB (or so) of free disk space, I would...

Scan input file once, and seperate into two files - Questions (No ParentId), Answers (ParentId)
Then, sort each of the two files using a Radix Sort.
- Questions by Id
- Answers by ParentId
Perform a Sorted File Merge of the resulting sorted files.

I'd probably do the sorts using unix or powershell - but if you programmed it yourself, I'd expect it to have memory usage well under 100MB as the file reading is forward only and can be streamed. SSD probably preferred - but since all access is sequential, even HDD would perform reasonably well.

30 Sep 2016
08:38 AM

Oren Eini

Jesus, I assume that this related to edits, you don't update the row, you create a new one? Not really sure.

30 Sep 2016
08:40 AM

Oren Eini

Jesus, You'll be writing a LOT of small files, that typically has a bad result for the FS in question

30 Sep 2016
08:40 AM

Jesús López

@Pop Catalin. There are about 32 million posts on SO and only about 960 answers have Id < ParentId. It doesn't seem to be caused by replication.

30 Sep 2016
08:59 AM

Jesús López

@Ayende you asked for a quick and dirty solution. Not for an efficient solution. My solution is not efficient but do the work in a few lines of code.

30 Sep 2016
10:55 AM

Oren Eini

Jesus, That might actually fail, hard, on large number of files See: http://stackoverflow.com/a/291292

30 Sep 2016
17:19 PM

Pop Cătălin

I'm actualy curious wha's the disk space usage when using files to store questions considering min disk allocation per file is 4KB without considering the file index. A quick calculation would show a minimum of 135 GB for 35M questions/answers.

01 Oct 2016
07:23 AM

Jesús López

@Ayende, it will work. I'm not storing all files in a single folder, I'm storing up to 1000 files and up to 1000 subdirectories in an single directory. http://stackoverflow.com/a/26205776/4540020

01 Oct 2016
07:49 AM

Jesús López

@Pop Cătălin there are 32.5 million posts on SO, 12.5 million of them are questions. I write files only for questions, so without considering index files the minimum space would be 50 Gb

06 Oct 2016
18:25 PM

Carsten Hansen

GOTO 2016 • Exploring StackOverflow Data • Evelina Gabasova https://www.youtube.com/watch?v=qlKZKN7il7c

Interesting analysis in F#

07 Oct 2016
06:43 AM

A Adler

I would read X chars from the file at a time, If the input contains valid XML then process the input and split the XML elemements as described below, if there are no valid xml elements then read another X chars from the stream etc.. if the input contained a fragment of additinal xml then reset the seek position to the end of the last valid xml element

Since the number of answers to a question are known and the questions to answers have some locality , it is probably safe to start populating a in memory dictionary with the question,list<answers>. as you encounter question , add them to the dictiionary.

As you ecounter answers , find the question in the dictionary and add them - when a question reaches its full answer capacity , pop it from the dictionary - format and write to the gzip stream

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Interview questionStackoverflow THAT

More posts in "Interview question" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Interview question" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication