未在Scrapy Spider中生成URL列表

ltqd579y  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(149)

我已经创建了一个零碎的蜘蛛,必须爬行整个网页,并提取网址。现在我必须删除社交媒体的网址,我想使网址列表,但不知何故,它不工作。当我试图附加每个网址列表中,它只是不断使网址列表。

import re
import scrapy
all_urls = []
class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    start_urls = [
            'https://www.wireshark.org/docs/dfref/i/ip.html',
        ]
    def parse(self, response):
        page= response.url.split("/")[-2]
        filename='quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        for r in response.css('a'):
            url = r.css('::attr(href)').get()
            print('all the urls are here', url)
            for i in url:
                all_urls.append(url)
                print(all_urls)

编辑:
我已经从列表中删除了社交媒体链接和其他信息,现在我想进一步刮每个链接,请看看我的方法是好的。

import requests
import scrapy
all_urls = []
remove = ["twitter"]
class QuotesSpider(scrapy.Spider):
    name = 'quotess'
    start_urls = [
            'https://www.wireshark.org/docs/dfref/i/ip.html',
         ]
    def parse(self, response):
        for url in response.css('a::attr(href)').getall():
            if url.startswith("http"):
                if 'twitter' not in url:
                    all_urls.append(url)
        print(all_urls)
        print(all_urls[0])
        for i in all_urls:
            response = requests.get(i)
            # print(response)
            yield response
jjjwad0x

jjjwad0x1#

获取页面上所有url的一个更简单的方法是链接css选择器并调用getall()
例如:

import scrapy
all_urls = []

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    start_urls = [
            'https://www.wireshark.org/docs/dfref/i/ip.html',
        ]
    def parse(self, response):
        for url in response.css('a::attr(href)').getall():
            all_urls.append(url)
        print(all_urls)

输出

['/', '/news/', '#', '/index.html#aboutWS', '/index.html#download', 'https://blog.wireshark.org/', '/code-of-conduct.html', '#', 'https://ask.wireshark.org/', '/faq.html', '/docs/', '/lists/', '/tools/', 'https://gitlab.co
m/wireshark/wireshark/-/wikis', 'https://gitlab.com/wireshark/wireshark/-/issues', '#', '/develop.html', 'https://www.wireshark.org/docs/wsdg_html_chunked/', 'https://gitlab.com/wireshark/wireshark/-/tree/master', 'https:/
/www.wireshark.org/download/automated', '../', 'https://twitter.com/wiresharknews', 'https://sysdig.com/privacy-policy/', '#', '#']

相关问题