Scrapy爬行(参考:无)

ldioqlga  于 2023-01-26  发布在  其他
关注(0)|答案(1)|浏览(157)

我是scrappy和python的新手,我正在用playwright方法从www.example.com中删除数据,它返回了(引用者:Aliexpress.com with playwright method and it returns (referer: None): Here is my code

class AliSpider(scrapy.Spider):
    name = "aliex"

    def start_requests(self):
        # GET request
        search_value = 'phones'
        yield scrapy.Request(f"https://www.aliexpress.com/premium/{search_value}.html?spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y",
         meta=dict(
            playwright= True,
            playwright_include_page = True,
            playwright_page_coroutines =[
                PageMethod('wait_for_selector', '.list--gallery--34TropR')
            ]
         ))
    

    async def parse(self, response):
        for data in response.xpath("//h1"):
            related_link = data.xpath(".//text()").get()
            yield{
                'related_link':related_link
            }

我越来越

2023-01-18 19:56:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.aliexpress.com/wholesale?SearchText=phones&spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y> (referer: None)
2023-01-18 19:56:55 [scrapy.core.engine] INFO: Closing spider (finished)

我尝试了xpath和css选择器,但结果相同。任何人都可以帮助我

qf9go6mv

qf9go6mv1#

这是一个完整的解决方案,使用独立的playwright和python,python可以在windows下工作。网站通过JavaScript动态加载数据,这就是为什么我使用**page. evaluate()**方法来执行JavaScript并滚动整个页面,否则,它不会抓取完整的ResultSet。

    • 脚本:**
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd
import time

data = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    search_value = 'phones'
    for page_num in range(1,4):
       
        page.goto(f"https://www.aliexpress.com/wholesale?SearchText=phones&catId=0&dida=y&g=y&initiative_id=SB_20230118063054&page={page_num}&spm=a2g0o.productlist.1000002.0&trafficChannel=main")
        page.wait_for_selector('[class="manhattan--content--1KpBbUi"]',timeout=30000)
        scroll_height = page.evaluate("""() => {
                                return Math.max(
                                  document.body.scrollHeight, document.documentElement.scrollHeight,
                                  document.body.offsetHeight, document.documentElement.offsetHeight,
                                  document.body.clientHeight, document.documentElement.clientHeight
                                );
                            }""")
        current_height = 0
        while current_height < scroll_height:
            current_height = page.evaluate("""() => {
                                window.scrollBy(0, window.innerHeight);
                                return window.scrollY;
                            }""")
            time.sleep(2)
        soup = BeautifulSoup(page.content(), 'lxml')
        for card in soup.select('[class="manhattan--content--1KpBbUi"]'):
            title = card.h1.text
            data.append({'title':title})

df = pd.DataFrame(data)
print(df)
    • 输出:**
title
0    Unlock Samsung Galaxy S10 S10+ s10e G970U G973...
1    SERVO K07 Plus mini Mobile Phone Pen Dual SIM ...
2    BLACKVIEW OSCAL C80 Smartphone 6.5" Waterdrop ...
3    Original Apple iPhone 7 Unlocked 99% New Mobil...
4    [World Premiere] Blackview BV9200 Rugged Smart...
..                                                 ...
175  Motorola StarTAC Rainbow 500mAh Fashion 90% Ne...
176  Original International Version HuaWei P30 Pro ...
177  Unlocked Original Apple iPhone SE Dual Core 2G...
178  2022 Unihertz TANK Large Battery Rugged Smartp...
179  75W Car Wireless Charger Car Mount Phone Holde...

[180 rows x 1 columns]

相关问题