我试图刮一个大的样本(100 k+)的书籍可在“https://www.goodreads.com/book/show/”,但我得到不断封锁。到目前为止,我已经尝试在我的代码中实现以下解决方案:
- 检查robots.txt以查找哪些站点/元素不可访问
- 指定一个或多个随机更改的标头
- 使用多个工作代理以避免被阻塞
- 在使用10个并发线程的每次抓取迭代之间设置最多20秒的延迟
下面是一个简化版本的代码,当试图只抓取书名和作者时会被阻塞,而不使用多个并发线程:
import requests
from lxml import html
import random
proxies_list = ["http://89.71.193.86:8080", "http://178.77.206.21:59298", "http://79.106.37.70:48550",
"http://41.190.128.82:47131", "http://159.224.109.140:38543", "http://94.28.90.214:37641",
"http://46.10.241.140:53281", "http://82.147.120.30:56281", "http://41.215.32.86:55561"]
proxies = {"http": random.choice(proxies_list)}
# real header
# headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
# multiple headers
headers_list = ['Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.38 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1623.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36']
headers = {"user-agent": random.choice(headers_list)}
first_url = 1
last_url = 10000 # Last book is 8,630,000
sleep_time = 20
for book_reference_number in range(first_url, last_url):
try:
goodreads_html = requests.get("https://www.goodreads.com/book/show/" + str(book_reference_number), timeout=5, headers=headers, proxies=proxies)
doc = html.fromstring(goodreads_html.text)
book_title = doc.xpath('//div[@id="topcol"]//h1[@id="bookTitle"]')[0].text.strip(", \t\n\r")
try:
author_name = doc.xpath('//div[@id="topcol"]//a[@class="authorName"]//span')[0].text.strip(", \t\n\r")
except:
author_name = ""
time.sleep(sleep_time)
print(str(book_reference_number), book_title, author_name)
except:
print(str(book_reference_number) + " cannot be scraped.")
pass
2条答案
按热度按时间tquggr8v1#
如果你真的想刮大型数据库,那么我会推荐 selenium ,被阻止的机会会很低,而且稳定。不需要
time.sleep()
(时间延迟,但您可以添加以使其更稳定)。验证码:h6my8fg22#
这些是什么代理?对于这种规模的任务来说,免费代理一到达就死了,如果网站保持警惕,我在数据中心代理方面也没有太多的运气。您可能想尝试使用住宅代理代替。这些做了更好的隐藏/绕过设备指纹,可能会暴露你的脚本作为一个机器人。还可以考虑设置重试,以防失败。
参考: