在python中解析电子邮件线程

xwbd5t1u  于 2022-12-02  发布在  Python
关注(0)|答案(1)|浏览(191)

The bounty expires in 4 days. Answers to this question are eligible for a +50 reputation bounty. govordovsky wants to draw more attention to this question: bounty winner should have expertise in mime format specifics, specifically related to email threads.

tl;dr questions:

  1. how to parse MIME content into threads (thus lists of individual replies & forwards)
  2. any libraries that do that?
  3. Does Mime-Version: 1.0 standardize the way threads are represented?
    I'm analyzing enron dataset ( https://www.cs.cmu.edu/~./enron/ , you can also browse the documents here: http://www.enron-mail.com/email/ ) This dataset is a collection of ~500K emails. Emails are represented as Mime-Version: 1.0 files, there are no attachments.
    This is a typical file:
Message-ID: <4250772.1075857358369.JavaMail.evans@thyme>^M
Date: Tue, 12 Dec 2000 09:19:00 -0800 (PST)^M
From: david.portz@enron.com^M
To: clint.dean@enron.com^M
Subject: City of Bryan Dec parking transactions^M
Cc: doug.gilbert-smith@enron.com, elizabeth.sager@enron.com, ^M
  melissa.murphy@enron.com^M
Mime-Version: 1.0^M
Content-Type: text/plain; charset=us-ascii^M
Content-Transfer-Encoding: 7bit^M
Bcc: doug.gilbert-smith@enron.com, elizabeth.sager@enron.com, ^M
  melissa.murphy@enron.com^M
X-From: David Portz^M
X-To: Clint Dean^M
X-cc: Doug Gilbert-Smith, Elizabeth Sager, Melissa Ann Murphy^M
X-bcc: ^M
X-Folder: \Clint_Dean_Dec2000\Notes Folders\Notes inbox^M
X-Origin: Dean-C^M
X-FileName: cdean.nsf^M
^M
Following discussions with you and Doug, attached is a draft parking
transaction agreement for your review and, if acceptable, for circualtion to
the counterparty.  Please call me with any questions.  --David

There is a handy, widely adopted python library that makes life easier in parsing those kind of files:

import email
import email.policy
parsed_email = email.message_from_string(open(filename, 'r').read(), policy=email.policy.default)
body = parsed_email.get_payload()
from_field = parsed_email['From']
...

However, I didn't find a reliable way to further parse email content to threads: sub_email_1 -> sub_email_2 -> ... > sub_email_n, etc. get_payload returns everything, all together.
Here is an example of MIME with threads: https://justpaste.it/bf5zr (the file is 233 lines, so pasted separately). There is clearly a thread:

  1. Christi L Nicolay sent email on 04/30/2001 02:20 PM
  2. later Christi L Nicolay replied to its own email on 05/03/2001 09:23 PM
  3. Lloyd Will replied to that thread on 05/03/2001 09:26 PM
  4. Christi L Nicolay replied on 05/07/2001 11:47 AM
  5. Tom May forwarded the whole thread on Mon, 7 May 2001 06:58:00 -0700
    Any library / existing solution that could do that? Looking at glance into the data, I got impression that there are numerous tiny variants how those threads are organized. Sometimes there are nested > > fields accompanying sub-emails, sometimes there is ---Original Message--- message, etc. It seems way less defined than MIME header fields.
    I can write some regex-backed python script that parses one email or another, but it will not work universally for the whole Enron dataset. Some more examples of threads from the Enron dataset:
    http://www.enron-mail.com/email/mann-k/discussion_threads/FW_Salmon_Energy_Turbine_Agreement_5.html
    http://www.enron-mail.com/email/brawner-s/discussion_threads/Fw_Fw_TIGHT_SKIRTS_AND_TEXANS_2.html
    http://www.enron-mail.com/email/brawner-s/_sent_mail/Fw_Time_Friends_3.html
    That led me to question #3: whether the mime format standardizes threads at all.
pvabu6sv

pvabu6sv1#

以下是将Enron数据集解析为线程的一些选项。
Matthew A. Russell* 撰写的O'Reilly书籍《挖掘社交网络》(第2版)中有一章介绍了解析安然数据集中的电子邮件线程的相关内容,并提供了解析电子邮件线程的示例代码。
Chapter 6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
下面是一个用于解析邮箱中线程的GitHub python project
我还在为你寻找其他选择。

相关问题