我已经创建了一个零碎的蜘蛛,必须爬行整个网页,并提取网址。现在我必须删除社交媒体的网址,我想使网址列表,但不知何故,它不工作。当我试图附加每个网址列表中,它只是不断使网址列表。
import re
import scrapy
all_urls = []
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://www.wireshark.org/docs/dfref/i/ip.html',
]
def parse(self, response):
page= response.url.split("/")[-2]
filename='quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
for r in response.css('a'):
url = r.css('::attr(href)').get()
print('all the urls are here', url)
for i in url:
all_urls.append(url)
print(all_urls)
编辑:
我已经从列表中删除了社交媒体链接和其他信息,现在我想进一步刮每个链接,请看看我的方法是好的。
import requests
import scrapy
all_urls = []
remove = ["twitter"]
class QuotesSpider(scrapy.Spider):
name = 'quotess'
start_urls = [
'https://www.wireshark.org/docs/dfref/i/ip.html',
]
def parse(self, response):
for url in response.css('a::attr(href)').getall():
if url.startswith("http"):
if 'twitter' not in url:
all_urls.append(url)
print(all_urls)
print(all_urls[0])
for i in all_urls:
response = requests.get(i)
# print(response)
yield response
1条答案
按热度按时间jjjwad0x1#
获取页面上所有url的一个更简单的方法是链接css选择器并调用
getall()
。例如:
输出