scrapy 我抓取的站点的输出包括html元素

kxe2p93d  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(105)

我只需要用字母A擦table。我的代码是这样的:

class ChallengeSpider(scrapy.Spider):
    name = "challenge"
    allowed_domains = ["laws.bahamas.gov.bs"]
    start_urls = ["http://laws.bahamas.gov.bs/cms/en/legislation/acts.html"]

字符串
问题是当我解析页面时,html元素出现在输出中。这是我的parse函数。

def parse(self, response):
        css_selector = ".hasTip"

        rows = response.css(css_selector)
        for row in rows:
            title = row.css(".hasTip").get()
            source_url = row.css(".hasTip").get()
            date = row.css(".hasTip").get()
            yield {
                "title": title,
                "source_url": source_url,
                "date": date,
            }


输出为:

[
{"title": "<div id=\"alphabet\" class=\"hasTip\" title=\"Alphabetical Selection\" rel=\"\n\t\t    Click on one of the alphabetical buttons to select all Acts commencing with that letter. The selection will 'stick' even if you navigate to another page.\">\n            <input type=\"submit\" id=\"submitX\" name=\"submit4\" class=\"btn btn-primary\" value=\"A\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"B\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"C\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"D\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"E\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"F\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"G\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"H\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"I\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"J\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"K\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"L\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"M\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"N\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"O\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"P\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Q\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"R\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"S\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"T\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"U\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"V\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"W\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"X\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Y\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Z\">            <input type=\"hidden\" name=\"pointintime\" value=\"2023-07-26 00:00:00\">\n        </div>", "source_url": "<div id=\"alphabet\" class=\"hasTip\" title=\"Alphabetical Selection\" rel=\"\n\t\t    Click on one of the alphabetical buttons to select all Acts commencing with that letter. The selection will 'stick' even if you navigate to another page.\">\n            <input type=\"submit\" id=\"submitX\" name=\"submit4\" class=\"btn btn-primary\" value=\"A\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"B\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"C\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"D\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"E\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"F\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"G\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"H\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"I\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"J\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"K\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"L\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"M\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"N\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"O\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"P\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Q\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"R\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"S\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"T\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"U\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"V\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"W\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"X\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Y\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Z\">            <input type=\"hidden\" name=\"pointintime\" value=\"2023-07-26 00:00:00\">\n        </div>", "date": "<div id=\"alphabet\" class=\"hasTip\" title=\"Alphabetical Selection\" rel=\"\n\t\t    Click on one of the alphabetical buttons to select all Acts commencing with that letter. The selection will 'stick' even if you navigate to another page.\">\n            <input type=\"submit\" id=\"submitX\" name=\"submit4\" class=\"btn btn-primary\" value=\"A\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"B\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"C\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"D\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"E\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"F\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"G\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"H\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"I\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"J\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"K\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"L\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"M\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"N\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"O\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"P\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Q\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"R\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"S\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"T\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"U\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"V\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"W\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"X\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Y\"><input type=\"submit\" id=\"submit4\" name=\"submit4\" class=\"btn\" value=\"Z\">            <input type=\"hidden\" name=\"pointintime\" value=\"2023-07-26 00:00:00\">\n        </div>"},
{"title": "<td class=\"hasTip minColumn hidden-phone\" title=\"Notes Relating to this Statute\" rel=\"\n
]


我需要做的是将http://laws.bahamas.gov.bs添加到pdf文件的url中,并清理我抓取的数据。我还需要做什么才能得到我需要的?

hgc7kmma

hgc7kmma1#

看起来你得到的比你想用CSS选择器得到的要多。.hasTip是一个存在于表的每个单元格中的类。所以每一行都是不同的值。
我想你可以这样做来获取所有感兴趣的行:

response.css("table.table > tbody > tr.row0")

字符串
然后,在遍历每一行时,您可以像这样获得所需的信息:

title = row.css("a::text").get()
source_url = row.css("a").attrib["href"]
...


希望这对你有帮助!

相关问题