从Outlook中的HTML表提取文本

neskvpey  于 2022-12-16  发布在  其他
关注(0)|答案(1)|浏览(130)

我需要从一个HTML表中提取值,该表每周都会收到电子邮件。我的循环是检查表的指定文件夹中的每封电子邮件,因为有一些现有的电子邮件包含我想要捕获的信息。最终,这些数据将被推送到Pandas Dataframe 中进行额外的操作。
该表有4个标题列,但5个数据列,这是造成各种错误的Pandas。下面的代码让我接近最终结果,但仍然有问题。数据被填充为1x 24在XLSX文件。我需要的数据在5x 5,但不知道我如何才能完成这一点与标题行只有4列。

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup

# email specs
outlook = win32com.client.Dispatch('outlook.application')
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.GetDefaultFolder(6).Folders["TestFolder"]
messages = inbox.Items

# read email
for message in messages:
    body_content = message.HTMLBody
    bs = BeautifulSoup(body_content, features="lxml")
    table = bs.find_all("table")
    for row in table:
        r = row.find_all('span')
        df = pd.DataFrame(r)

    # save dataframe
    df.to_excel('df.xlsx')

HTML表格文件示例如下-

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:"Book Antiqua";
    panose-1:2 4 6 2 5 3 5 3 3 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:12.0pt;
    font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:#0563C1;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:#954F72;
    text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
    {mso-style-name:msonormal;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    font-size:12.0pt;
    font-family:"Times New Roman",serif;}
p.xl1519243, li.xl1519243, div.xl1519243
    {mso-style-name:xl1519243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7119243, li.xl7119243, div.xl7119243
    {mso-style-name:xl7119243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7219243, li.xl7219243, div.xl7219243
    {mso-style-name:xl7219243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7319243, li.xl7319243, div.xl7319243
    {mso-style-name:xl7319243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7419243, li.xl7419243, div.xl7419243
    {mso-style-name:xl7419243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7519243, li.xl7519243, div.xl7519243
    {mso-style-name:xl7519243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl1519240, li.xl1519240, div.xl1519240
    {mso-style-name:xl1519240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7119240, li.xl7119240, div.xl7119240
    {mso-style-name:xl7119240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7219240, li.xl7219240, div.xl7219240
    {mso-style-name:xl7219240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7319240, li.xl7319240, div.xl7319240
    {mso-style-name:xl7319240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7419240, li.xl7419240, div.xl7419240
    {mso-style-name:xl7419240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7519240, li.xl7519240, div.xl7519240
    {mso-style-name:xl7519240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
span.EmailStyle30
    {mso-style-type:personal-compose;
    font-family:"Calibri",sans-serif;
    font-weight:bold;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-size:10.0pt;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="font-family:&quot;Calibri&quot;,sans-serif"><o:p>&nbsp;</o:p></span></b></p>
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-family:&quot;Calibri&quot;,sans-serif">Lorem<o:p></o:p></span></b></p>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-family:&quot;Calibri&quot;,sans-serif"><br>
<br>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus ullamcorper nec leo sed blandit. Etiam fringilla aliquam erat, et congue mi malesuada eget. Vestibulum elit magna, fermentum eu orci sit amet, eleifend finibus felis. Sed massa orci, dignissim vel ex sit amet, fringilla imperdiet lorem. Phasellus euismod cursus lorem. Morbi tristique sapien a feugiat rutrum. Nulla egestas fermentum lorem, eu bibendum libero tristique in. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Nulla magna ante, ultrices eu ipsum vitae, vulputate consectetur neque. Nam vel nisi euismod, pulvinar mauris non, porta felis. Nulla pellentesque libero ut semper iaculis..<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,sans-serif">&nbsp;&nbsp; <o:p></o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="0" style="width:700.0pt;border-collapse:collapse">
<tbody>
<tr style="height:15.75pt">
<td width="293" nowrap="" colspan="2" style="width:220.0pt;border:solid windowtext 1.0pt;border-right:solid black 1.0pt;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">Asset ID &amp; Name<o:p></o:p></span></b></p>
</td>
<td width="231" nowrap="" style="width:173.0pt;border:solid windowtext 1.0pt;border-left:none;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">Lead Participant<o:p></o:p></span></b></p>
</td>
<td width="117" nowrap="" style="width:88.0pt;border:solid windowtext 1.0pt;border-left:none;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">ISO Determination<o:p></o:p></span></b></p>
</td>
<td width="292" nowrap="" style="width:219.0pt;border:solid windowtext 1.0pt;border-left:none;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">ISO Comment<o:p></o:p></span></b></p>
</td>
</tr>
<tr style="height:15.75pt">
<td nowrap="" style="border:solid windowtext 1.0pt;border-top:none;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">65133<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:15.75pt">
<td nowrap="" style="border:solid windowtext 1.0pt;border-top:none;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">69067<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:15.75pt">
<td nowrap="" style="border:solid windowtext 1.0pt;border-top:none;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">69070<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
</tr>
<tr>
<td width="67" style="width:50.0pt;padding:0in 0in 0in 0in"></td>
<td width="227" style="width:170.0pt;padding:0in 0in 0in 0in"></td>
<td width="231" style="width:173.0pt;padding:0in 0in 0in 0in"></td>
<td width="117" style="width:88.0pt;padding:0in 0in 0in 0in"></td>
<td width="292" style="width:219.0pt;padding:0in 0in 0in 0in"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-family:&quot;Calibri&quot;,sans-serif"><br>
<br>

预期产出为:

7vhp5slm

7vhp5slm1#

假设body_content是样本数据,并且每个电子邮件消息只有一个表(或者目标表是任何给定消息中的第一个表):

# get table
bs = BeautifulSoup(body_content, features="html.parser")
tables = bs.find_all("table")
table = tables[0]  # assumes it is always the first table in any message

# get table data
records = []
rows = table.findAll("tr")
for row in rows:
    cells = row.findAll("td")
    row_data = []
    for cell in cells:
        val = cell.find("p", attrs={"class": "MsoNormal"})
        try:
            row_data.append(val.text.strip())
        except AttributeError:
            row_data.append(None)
    if not all(v is None for v in row_data):  # don't add empty row like end of sample data
        records.append(row_data)

# since there are only four column headers, we will extract these out and add an extra one
headers = records.pop(0)
headers.append("Unknown Extra Column")

# create pandas dataframe
df = pd.DataFrame(records, columns=headers)

对于示例数据,Dataframe应如下所示:

Asset ID & Name Lead Participant ISO Determination ISO Comment  \
0           65133           XXXXXX            XXXXXX      XXXXXX   
1           69067           XXXXXX            XXXXXX      XXXXXX   
2           69070           XXXXXX            XXXXXX      XXXXXX   

  Unknown Extra Column  
0               XXXXXX  
1               XXXXXX  
2               XXXXXX

您需要更改循环中某些内容的嵌套,但这适用于所提供的示例数据。

相关问题