pandas 如何分割刮擦文本并创建数据框?

mzsu5hc0  于 2022-11-05  发布在  其他
关注(0)|答案(2)|浏览(130)

下面是我的代码。

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
r = requests.get("https://www.gutenberg.org/browse/scores/top")
soup =   BeautifulSoup(r.content,"lxml")
List1 = soup.find_all('ol')
List1

newlist = []
for List in List1:
    ulList = List.find_all('li')
    extend_list = []
    for li in ulList:
        #extend_list = []
        for link in li.find_all('a'):
            a = link.get_text()
        print(a)

我的输出是

1.我想把输出转换成列表的列表

[['A Room with a View by E. M.  Forster (37480)'], ['Middlemarch by George Eliot (34900)'],['Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (31929)']]

1.将列表拆分为两部分

[["A Room with a View by E. M.  Forster", "37480"], ["Middlemarch by George Eliot", "34900"],["Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott", "31929"]]

1.将数据加载到数据框

ep6jt1vc

ep6jt1vc1#

您可以使用一个简短的正则表达式和str.extract一步完成:

df = (pd.Series([e.text for e in soup.select('ol a')])
        .str.extract(r'(.*) \((\d+)\)$')
        .set_axis(['Ebooks', 'Code'], axis=1)
     )

如果需要列表的中间列表:

import re

L = [list(m.groups()) for e in soup.select('ol a')
     if (m:=re.search(r'(.*) \((\d+)\)$', e.text))]

df = pd.DataFrame(L, columns=['Ebooks', 'Code'])

输出:

Ebooks   Code
0                 A Room with a View by E. M.  Forster  37480
1                          Middlemarch by George Eliot  34900
2    Little Women; Or, Meg, Jo, Beth, and Amy by Lo...  31929
3           The Enchanted April by Elizabeth Von Arnim  31648
4        The Blue Castle: a novel by L. M.  Montgomery  30646
..                                                 ...    ...
395                           Hapgood, Isabel Florence  12240
396                                  Mill, John Stuart  12223
397                               Marlowe, Christopher  11760
398                                     Wharton, Edith  11728
399                           Burnett, Frances Hodgson  11630

[400 rows x 2 columns]
zfycwa2u

zfycwa2u2#

简化代码,同时更具体地选择元素:

for e in soup.select('ol a'):
    data.append({
        'Ebook':e.text.split('(')[0].strip(),
        'Code':e.text.split('(')[-1].strip(')')
    })

示例

import requests
import pandas as pd
from bs4 import BeautifulSoup
r = requests.get("https://www.gutenberg.org/browse/scores/top")
soup =   BeautifulSoup(r.content,"lxml")

data = []

for e in soup.select('ol a'):
    data.append({
        'Ebook':e.text.split('(')[0].strip(),
        'Code':e.text.split('(')[-1].strip(')')
    })
pd.DataFrame(data)
输出

| | 电子书|编码|
| - -|- -|- -|
| 第0页|E. M.福斯特的《一间风景房》|小行星37480|
| 一个|乔治·艾略特《米德尔马契》|小行星34900|
| 2个|《小妇人》或者,路易莎·梅·奥尔科特的《梅格、乔、贝丝和艾米》|小行星31929|
| 三个|《魔法四月》伊丽莎白·冯·阿尼姆|小行星31648|
| 四个|蓝色城堡:L·M·蒙哥马利的小说|小行星30646|
| 五个|《白鲸记》或《鲸鱼》赫尔曼·梅尔维尔著|小行星30426|
| 六个|莎士比亚全集作者:威廉·莎士比亚|小行星30266|
...

相关问题