scrapy 正在从XML文件中检索文本数据< content:encoded>

tjjdgumg  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(97)

我有一个XML文件,看起来像这样:

<rss version="2.0"
    xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.2/"
>

<channel>

<item>
        <title>Label: some_title&quot;</title>
        <link>some_link</link>
        <pubDate>some_date</pubDate>
        <dc:creator><![CDATA[University]]></dc:creator>
        <guid isPermaLink="false">https://link.link</guid>
        <description></description>
        <content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some text<a href="https://link.link" target="_blank" rel="noopener noreferrer">text</a> some more text</strong><!--more-->

[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="link.link.png" alt="" width="524" height="316" /> <em>A <a href="link.link" target="_blank" rel="noopener noreferrer">screenshot</a> by the people</em>[/caption]

&nbsp;

<strong>some more text</strong>

&nbsp;
<div class="entry-content">

<em>Leave your comments</em>

</div>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
        <excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>

我想提取<content:encoded>部分中的原始文本,不包括标签和url。我已经用BeautifulSoup和Scarpy以及其他lxml方法尝试过了。大多数方法返回空列表。
有没有一种方法可以让我在不使用regex的情况下检索这些信息?
非常感谢。

更新

我使用以下命令打开了XML文件:

content = []
with open(xml_file, "r") as file:
    content = file.readlines()
    content = "".join(content)
    xml = bs(content, "lxml")

然后我试着和Scrapy一起做这个:

response = HtmlResponse(url=xml_file, encoding='utf-8')

response.selector.register_namespace('content', 
                                     'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()

返回一个空列表。
并尝试了第一个答案中的代码:

soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
    s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)

并得到以下错误:Only the following pseudo-classes are implemented: nth-of-type.
当我用lxml打开文件时,我运行了以下for循环:

data = {}
n = 0

for item in xml.findall('item'):
  id = 'claim_id_' + str(n)
  keys = {}
  title = item.find('title').text
  keys['label'] = title.split(': ')[0]
  keys['claim'] = title.split(': ')[1]
  if item.find('content:encoded'):
    keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
  data[id] = keys
  print(data)
  n += 1

它很好地保存了标签和声明,但没有保存文本。现在我用BeautifulSoup打开了文件,它返回了以下错误:'NoneType' object is not callable

piztneat

piztneat1#

如果你只需要<strong>标签中的文本,你可以使用我的例子。否则,只有regex在这里看起来是合适的:

from bs4 import BeautifulSoup

xml_doc = """
<rss version="2.0"
    xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.2/"
>

...the XML from the question...

</rss>
"""

soup = BeautifulSoup(xml_doc, "xml")

soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")

text = "\n".join(
    s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)

印刷品:

some text text some more text
some more text
RESEARCH | ARTICLE
zvokhttg

zvokhttg2#

我最终使用正则表达式(regex)得到了文本部分。

import re

for item in root.iter('item'):
  grandchildren = item.getchildren()
  for grandchild in grandchildren:
    if 'encoded' in grandchild.tag:
      text = grandchild.text
      text = re.sub(r'\[.*?\]', "", text)   # gets rid of square brackets and their content
      text = re.sub(r'\<.*?\>', "", text)   # gets rid of <> signs and their content
      text = text.replace("&nbsp;", "")   # gets rid of &nbsp;
      text = " ".join(text.split())

相关问题