其实我是想用儿童产品刮这些网站的全儿童产品链接。
我刮的网站是:https://lappkorea.lappgroup.com/
我的工作代码是:
from selenium import webdriver
from lxml import html
driver = webdriver.Chrome('./chromedriver')
driver.get('https://lappkorea.lappgroup.com/product-introduction/online-catalogue/power-and-control-cables/various-applications/pvc-outer-sheath-and-coloured-cores/oelflex-classic-100-300500-v.html')
elems = driver.find_elements_by_xpath('span[contains(.//table[contains(@class, "setuArticles") and not(@data-search)]//td/div/@data-content')
urls = []
content = driver.page_source
tree = html.fromstring(content)
all_links = tree.xpath('.//a/@href')
first_link = all_links[0]
for elem in elems:
print(elem.text)
urls.append(elem.get_attribute("href"))
for elem in elems:
writer.write(f"{elem.get_attribute('href')}, {elem.text}\n")
writer.close()
driver.quit()
这是我想从整个网站刮的数据:
enter image description here
当我们进入任何产品,因为一个产品的链接是提到的代码。我们拖下来,并点击任何文章编号和弹出数据表appers,点击它pdf将打开。
我只是想与他们的PDF链接Artciles数字。
我有一个CSV的所有父链接,我刮,因为我给予一个链接在脚本中,即:“https://lappkorea.lappgroup.com/product-introduction/online-catalogue/power-and-control-cables/various-applications/pvc-outer-sheath-and-coloured-cores/oelflex-classic-100-300500-v.html“。我想从CSV文件中获取所有链接,并像上面所做的那样,抓取所有产品的文章编号和子产品链接,然后将它们保存在一个CSV文件中,并将其分为单独的列,一列用于文章编号,一列用于子产品链接
import requests
from bs4 import BeautifulSoup
from lxml import html
rows = open("products.csv", 'r').read().split('\n')
writer = open('zain details.csv', 'w')
for row in rows:
cols = row.split(',')
url = cols[0]
response = requests.get(url)
print(url)
if response.status_code == 200:
tree = html.fromstring(response.content)
# url = "https://lappkorea.lappgroup.com/product-introduction/online-catalogue/power-and-control-cables/various-applications/pvc-outer-sheath-and-coloured-cores/oelflex-classic-100-300500-v.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for article in soup.select('[id*="-article-"] [data-content]'):
s = BeautifulSoup(article["data-content"], "html.parser")
link = s.select_one("a:-soup-contains(Datasheet)")["href"]
num = article.get_text(strip=True)
print("{:<10} {}".format(num, link))
record = f"{cols[0]}, {cols[1]}, {an}\n"
writer.write(record)
CSV文件的图像:
在服务器上运行时出错
帮我摆脱这个在服务器上运行
2条答案
按热度按时间wpx232ag1#
这是一只好斗的蜘蛛,你想做什么就做什么。
重现步骤:
1.安装报废
pip install scrapy
1.启动项目
scrapy startproject lappkorea
cd lappkorea
1.在
./lappkorea/spiders
中打开一个新文件,然后复制并粘贴以下代码1.####此处**-〉
scrapy crawl lappkorea -o "zain details.csv"
〈-这是您的输出文件**更新
如果您想将蜘蛛作为脚本运行,而不是使用命令行,您可以这样做。
s1ag04yj2#
请尝试:
印刷品: