Importing the Stack Overflow dataset into RavenDB
Around 2017 we needed to test RavenDB with realistic datasets. That was the time that we were working hard on the 4.0 release, and we wanted to have some common dataset that was production quality (for all the benefits and complications that this brings) to play with.
A serious issue was that we needed that dataset to also be public, because we wanted to discuss its details. The default dataset people usually talk about in such a scenario is the Enron emails, but that is around half a million documents and quite small, all things considered.
Luckily for us, Stack Overflow has made their dataset publicly available in a machine readable format. That means that we could take that, adapt that to RavenDB and use that to test various aspects of our behaviors with realistic data.
The data is distributed as a set of XML files, so I quickly wrote something that would convert the data to a JSON format and adapt the model to a more relational one. The end result was a dataset with 18 million documents and with a hefty size of 52 GB. I remember that at the time, working with this data was a lengthy process. Importing the data took a long time and indexing even longer.
A few years later, this is still our go-to dataset for anything involving non-trivial amount of data, but we have gotten to the point where the full process of working with it has shrunk significantly. It used to take 45+ minutes to import the data, now it takes less than 10, for example. Basically, we made RavenDB good enough that it wasn’t that much of a challenge.
Of course… Stack Overflow continues to publish their dataset… so I decided it was time to update their data again. I no longer have the code that I used to do the initial import, but the entire process was fairly simple.
You can look at the code that is used to do the import here. This is meant to be quick & dirty code, mind you. It is about 500 lines of code and handles a surprisingly large number of edge cases.
You can find the actual data dump here.
And the explanation about the schema is here.
There is also a database diagram here.
In case you missed the point, the idea is that I want to remember how I did it for the next time I'll want to refresh our dataset.
So far, I imported a bunch of Stack Exchange communities:
- World Building – Just over 100K documents and 1 GB in size. Small enough to play with seamlessly.
- Super User – 1.85 million documents and weighing 4 GB in size. I think we’ll use that as the default database for showing things off on the Raspberry Pi edition.
- Stack Overflow – 40.5 million documents and exceeding 150 GB in size. This is a great scenario for working with a significant amount of data. That is likely to be our new default benchmarking database.
The other advantage is that everyone is familiar with Stack Overflow. It makes for a great demo when we can pull up realistic data on the fly.
It already gave me some interesting details to explore. For example, enabling documents compression mode for the Super User community reduced the disk utilization to under 2 GB. That is a great space-saving, and it means that we can easily fit the entire database on a small SD card and have a “RavenDB Server + Database in a box” as a Raspberry Pi.
The Stackoverflow dataset is 150GB without compression, with documents compression, it dropped to just 57GB, which is all kinds of amazing.
They make for great demos .
I'm very curious about what this process looks like with calculated values. Seems like these fields could be a map-reduce on Posts.
Yes, you can compute those, but given that they are already there in the source data, I included them.