The Enron Email Dataset

Email research has taken a giant step forward with a positive side-effect from the whole Enron debacle. We now have, thanks to the foresight of a few researchers, an email corpus from an actual, living corporation. The latest release of the dataset can be downloaded from http://www.cs.cmu.edu/~enron, thanks to William Cohen. Another page discussing the Enron corpus can be found at Ron Bekkerman's homepage.

Some useful files

MD5 Digest to Relative Filepath Mapping

The Enron distribution above (the March 2, 2004 version) contains 517,431 distinct files contained in 150 ''user folders''. By using the MD5 digest of the body of the emails, one can identify 250,484 ''unique'' emails. A mapping file shows you the MD5 digest and relative filepath for all the files in the Enron corpus.

This file was constructed as follows. A first pass calculating the MD5 digest of the email messages was made. Files having the same MD5 digest were then grouped by their timestamp. Those within a day of each other were considered the same message and a revised MD5 digest was calculated for the MD5-date grouping by appending the date of the earliest message in the grouping to the email body. This still resulted in messages with the same MD5 having multiple authors so the de-duplication of messages by this ad-hoc method is clearly not perfect. Caveat emptor!

MD5 to Authors Mapping

Using the MD5 to filepath mapping above, one can construct a mapping file showing the authors found in the headers for all the files with a given MD5 digest. Since the de-duplication process detailed above is not perfect, some ''unique'' emails have multiple authors. The format of the file is: <MD5 digest> <author email address> -%- <author email address> ... Notice, the ''-%-'' symbol for separating email addresses since they can contain empty spaces.

MD5 to Recipients Mapping

Likewise, we can construct a mapping file between the MD5 digest of an unique email and all the recipients that appear in the ''To'', ''CC'', and ''BCC'' fields in the header. The format for this file is the same as the MD5 to authors mapping. One line per unique MD5 digest and a '-%-' separated list of extracted recipient email addresses.

The Folder Users

Eventhough the corpus contains a putative 150 users, my exploration of the data shows that we really have 149 users. This is a mapping between the top folders in the corpus and my normalization for an authors email address.

The Enron corpus is a ''real'' dataset. It contains multiple inconsistencies, in particular, it has multiple email addresses for the same users. I have created a mapping between the 'raw' email address and the normalized email address for the 149 Enron folder users.

Making Word Lists of the Enron Email Corpus

These Python scripts show you how to extract word lists from the Enron corpus. They depend on the MD5 to filepath mapping discussed above.