Working with the Enron dataset
Every now and then I need to do some work with text, and the Enron data set is one of the most well known corpuses.
I ended up writing the parsing code for that so many times that it isn’t even funny. Therefor, I decided to make my life easier and just post it somewhere that I can refer back to it.
This code simply unpack the Enron dataset into a .NET object, from where you can start processing the text in interesting ways.
Comments
I love using dotnet-script for this (https://github.com/filipw/dotnet-script)
They allow you to use the #r "nuget: MimeKitLite, 2.11.0" syntax to reference the nuget packages so you can just run the .csx script without having to do extra manual logic.
Comment preview