scrapy Python Scraping网站网址和文章编号

vngu2lb8  于 2022-11-09  发布在  Python
关注(0)|答案(2)|浏览(162)

其实我是想用儿童产品刮这些网站的全儿童产品链接。
我刮的网站是:https://lappkorea.lappgroup.com/
我的工作代码是:

from selenium import webdriver
from lxml import html

driver = webdriver.Chrome('./chromedriver')
driver.get('https://lappkorea.lappgroup.com/product-introduction/online-catalogue/power-and-control-cables/various-applications/pvc-outer-sheath-and-coloured-cores/oelflex-classic-100-300500-v.html')

elems = driver.find_elements_by_xpath('span[contains(.//table[contains(@class, "setuArticles") and not(@data-search)]//td/div/@data-content')

urls = []

content = driver.page_source
tree = html.fromstring(content)

all_links = tree.xpath('.//a/@href')

first_link = all_links[0]

for elem in elems:
    print(elem.text)
    urls.append(elem.get_attribute("href"))
for elem in elems:
        writer.write(f"{elem.get_attribute('href')}, {elem.text}\n")

writer.close()

driver.quit()

这是我想从整个网站刮的数据:
enter image description here
当我们进入任何产品,因为一个产品的链接是提到的代码。我们拖下来,并点击任何文章编号和弹出数据表appers,点击它pdf将打开。
我只是想与他们的PDF链接Artciles数字。
我有一个CSV的所有父链接,我刮,因为我给予一个链接在脚本中,即:“https://lappkorea.lappgroup.com/product-introduction/online-catalogue/power-and-control-cables/various-applications/pvc-outer-sheath-and-coloured-cores/oelflex-classic-100-300500-v.html“。我想从CSV文件中获取所有链接,并像上面所做的那样,抓取所有产品的文章编号和子产品链接,然后将它们保存在一个CSV文件中,并将其分为单独的列,一列用于文章编号,一列用于子产品链接

import requests
from bs4 import BeautifulSoup
from lxml import html

rows = open("products.csv", 'r').read().split('\n')
writer = open('zain details.csv', 'w')

for row in rows:
    cols = row.split(',')

    url = cols[0]

    response = requests.get(url)
    print(url)

    if response.status_code == 200:
        tree = html.fromstring(response.content)

# url = "https://lappkorea.lappgroup.com/product-introduction/online-catalogue/power-and-control-cables/various-applications/pvc-outer-sheath-and-coloured-cores/oelflex-classic-100-300500-v.html"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for article in soup.select('[id*="-article-"] [data-content]'):
    s = BeautifulSoup(article["data-content"], "html.parser")
    link = s.select_one("a:-soup-contains(Datasheet)")["href"]
    num = article.get_text(strip=True)
    print("{:<10} {}".format(num, link))

    record = f"{cols[0]}, {cols[1]}, {an}\n"

    writer.write(record)

CSV文件的图像:

在服务器上运行时出错

帮我摆脱这个在服务器上运行

wpx232ag

wpx232ag1#

这是一只好斗的蜘蛛,你想做什么就做什么。
重现步骤:
1.安装报废pip install scrapy
1.启动项目scrapy startproject lappkorea

  1. cd lappkorea
    1.在./lappkorea/spiders中打开一个新文件,然后复制并粘贴以下代码
    1.####此处**-〉scrapy crawl lappkorea -o "zain details.csv"〈-这是您的输出文件**
import scrapy
import lxml.html as lhtml

class LappkoreaSpider(scrapy.Spider):
    name = 'lappkorea'
    allowed_domains = ['lappgroup.com']

    def start_requests(self):
        with open('products.csv') as file:   # <- url file
            for line in file:
                cols = line.split(',')
                url = cols[0]
                yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for row in response.xpath('//tr[@class="article"]'):
            div = row.xpath('.//div[contains(@class,"pointer jsLoadPopOver")]')
            idnum = div.xpath('./text()').get()
            html = div.xpath('./@data-content').get()
            tree = lhtml.fromstring(html)
            link = tree.xpath("//ul/li/a/@href")[0]
            yield {
                "id": idnum.strip(),
                "link": response.urljoin(link)
            }

更新
如果您想将蜘蛛作为脚本运行,而不是使用命令行,您可以这样做。

import scrapy
from scrapy.crawler import CrawlerProcess
import lxml.html as lhtml

class LappkoreaSpider(scrapy.Spider):
    name = 'lappkorea'
    allowed_domains = ['lappgroup.com']

    custom_settings = {
        'FEEDS': {
            'filename.csv': {   # <---- this will be the output file

# -------------------

                'format': 'csv',
                'encoding': 'utf8',
                'store_empty': False,
                'fields': ['id', 'link']
             }
         }
    }

    def start_requests(self):
        with open('products.csv') as file:   # <- url file
            for line in file:
                cols = line.split(',')
                url = cols[0]
                yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for row in response.xpath('//tr[@class="article"]'):
            div = row.xpath('.//div[contains(@class,"pointer jsLoadPopOver")]')
            idnum = div.xpath('./text()').get()
            html = div.xpath('./@data-content').get()
            tree = lhtml.fromstring(html)
            link = tree.xpath("//ul/li/a/@href")[0]
            yield {
                "id": idnum.strip(),
                "link": response.urljoin(link)
            }

process = CrawlerProcess()
process.crawl(LappkoreaSpider)
process.start()
s1ag04yj

s1ag04yj2#

请尝试:

import requests
from bs4 import BeautifulSoup

url = "https://lappkorea.lappgroup.com/product-introduction/online-catalogue/power-and-control-cables/various-applications/pvc-outer-sheath-and-coloured-cores/oelflex-classic-100-300500-v.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for article in soup.select('[id*="-article-"] [data-content]'):
    s = BeautifulSoup(article["data-content"], "html.parser")
    link = s.select_one("a:-soup-contains(Datasheet)")["href"]
    num = article.get_text(strip=True)
    print("{:<10} {}".format(num, link))

印刷品:

...

1120824    /fileadmin/documents/technische_doku/datenblaetter/oelflex/DB00100004EN.pdf
1120825    /fileadmin/documents/technische_doku/datenblaetter/oelflex/DB00100004EN.pdf
1120826    /fileadmin/documents/technische_doku/datenblaetter/oelflex/DB00100004EN.pdf
1120827    /fileadmin/documents/technische_doku/datenblaetter/oelflex/DB00100004EN.pdf
1120828    /fileadmin/documents/technische_doku/datenblaetter/oelflex/DB00100004EN.pdf

相关问题