我尝试使用scrapy抓取一个网站的所有url。但是网站中的一些页面有无限滚动,抓取的数据不完整。使用的代码是
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.url import url_query_cleaner
def process_links(links):
for link in links:
link.url = url_query_cleaner(link.url)
yield link
class myCrawler(CrawlSpider):
name = 'symphony'
allowed_domains = ['theshell.org']
start_urls = ['https://www.theshell.org/']
base_url = 'https://www.theshell.org/'
custom_settings = {
# in order to reduce the risk of getting blocked
'DOWNLOADER_MIDDLEWARES': {'sitescrapper.middlewares.RotateUserAgentMiddleware': 400, },
'COOKIES_ENABLED': False,
'CONCURRENT_REQUESTS': 6,
'DOWNLOAD_DELAY': 1,
# Duplicates pipeline
'ITEM_PIPELINES': {'sitescrapper.pipelines.DuplicatesPipeline': 300},
# In order to create a CSV file:
'FEEDS': {'csv_file.csv': {'format': 'csv'}}
}
rules = (
Rule(
LinkExtractor(allow_domains='theshell.org',
deny=[
r'calendar',
],
),
process_links=process_links,
callback='parse_item',
follow=True
),
)
def parse_item(self, response):
yield {
'url': response.url,
'html_data':response.text
}
这个页面有无限爬网机制。如何使用scrapy来检测和爬网这个无限爬网。
1条答案
按热度按时间ffvjumwh1#
Infinity/load更多的时候是 AJAX 请求。所以你可以使用API url。这里我使用API url和scrapy默认模板代替crawlSpider。