直到3天前,我才能刮到**target site**。但是,它开始显示错误,我将张贴在下面。当我看了网站的源代码,我看不到任何变化。它也返回作为scrapy(200)响应。我正在使用代理和用户代理。我改变了他们,但仍然是相同的结果。我一直得到json解码错误。
错误:
文件“/usr/lib/python3.8/json/decoder.py“,第355行,在raw_decode中从无json.decoder中引发JSONDecodeError(“期望值”,s,错误值)。JSONDecodeError:预期值:第1行第1列(字符0)
我的代码:
import scrapy
import json
import datetime
import bs4
import re
import time
from requests.models import PreparedRequest
import logging
from hepsibura_spider.items import HepsiburaSpiderItem
from scrapy.crawler import CrawlerProcess
class HepsiburaSpider(scrapy.Spider):
name = 'hepsibura'
# allowed_domains = ['www.hepsibura.com']
handle_httpstatus_list = [301]
def start_requests(self):
urls = [
'https://www.hepsiburada.com/monitor-bilgisayarlar-c-116465?filtreler=satici:Hepsiburada;?_random_number={rn}#tabIndex=0',
]
for url in urls:
params = []
# added a meta to provide the used url here
main_url, parameters = url.split('&') if '&' in url else url, None
parameters = parameters.split(':') if parameters else []
for parameter in parameters:
key, value = parameter.split('=')
params.append((key.strip(), value.strip()))
# params.append(('main_url', main_url))
if 'sayfa' not in dict(params):
params.append(('sayfa', '1'))
yield scrapy.Request(
url=url.format(rn=time.time()),
callback=self.parse_json,
meta={
'main_url': main_url,
'params': dict(params),
},
headers={
'Cache-Control': 'store, no-cache, must-revalidate, post-check=0, pre-check=0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.5134.152 Safari/537.36',
}
)
def parse_json(self, response):
if response.status == 301:
logging.log(logging.INFO, 'Finished scraping')
return
current_url = response.request.url.split('&')[0].strip()
parameters = response.meta.get('params')
soup = bs4.BeautifulSoup(response.text,'lxml')
scripts = soup.select('script')
data_script = ''
for script in scripts:
# print(script.text)
if 'window.MORIA.PRODUCTLIST = {' in str(script):
print('Found the data')
data_script = str(script)
break
data_script = data_script.replace('<script type="text/javascript">','').replace('window.MORIA = window.MORIA || {};','').replace('window.MORIA.PRODUCTLIST = {','').replace('\'STATE\': ', '').replace('</script>','')[:-4]
json_data = json.loads(data_script)
products = json_data['data']['products']
for product in products:
item = HepsiburaSpiderItem()
item['rowid'] = hash(str(datetime.datetime.now()) + str(product['productId']))
item['date'] = str(datetime.datetime.now())
item['listing_id'] = product['variantList'][0]["listing"]["listingId"]
item['product_id'] = product['variantList'][0]["sku"].lower()
item['product_name'] = product['variantList'][0]['name']
item['price'] = float(product['variantList'][0]['listing']['priceInfo']['price'])
item['url'] = 'https://www.hepsiburada.com' + product['variantList'][0]["url"]
item['merchantName'] = product['variantList'][0]["listing"]["merchantName"].lower()
yield item
parameters['sayfa'] = int(parameters['sayfa']) + 1
req = PreparedRequest()
req.prepare_url(current_url, parameters)
yield scrapy.Request(
url=req.url,
callback=self.parse_json,
meta={
'params': parameters,
},
headers={
'Cache-Control': 'store, no-cache, must-revalidate, post-check=0, pre-check=0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.5134.152 Safari/537.36',
}
)
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(HepsiburaSpider)
process.start()
我发现了一些东西。网站改变了他们的json格式。每个请求生成唯一的id:
window.MORIA.PRODUCTLIST = Object.assign(window.MORIA.PRODUCTLIST || {}, {
'60cada8e-57dd-466e-f7af-62efca4fa8a8': {
我怎么才能绕过这个呢?
- 谢谢-谢谢
1条答案
按热度按时间ndasle7k1#
真的没有必要把BeautifulSoup和Scrapy一起用。
问题是
data_script
是空的。去掉循环,只需要使用xpath选择包含该文本的
script
标签,然后使用re_first()函数获取JSON字符串。此外,您可能需要检查
data
是否不为空,以供以后使用。输出量: