pandas 为什么我的代码不能覆盖网站的所有页面(网页抓取)?

r7xajy2e  于 11个月前  发布在  其他
关注(0)|答案(3)|浏览(124)

我想网页刮“搜索:pc”的网站称为jumia的一部分.我想在所有的页面上搜索,但不幸的是它没有工作,我不知道为什么它覆盖文件,而它是循环外,并通过使用:

with pd.ExcelWriter("output.xlsx", engine="openpyxl", mode="a", if_sheet_exists="replace") as writer:
        pop.to_excel(writer, sheet_name="sheet1"

字符串
而不是:

with open(f"output.xlsx" ,"a") :
    with pd.ExcelWriter("output.xlsx") as writer:
        pop.to_excel(writer,sheet_name="sheet2")


但它会导致一个错误:

File "c:\Users\hp\Desktop\python_projects\test3.py", line 40, in <module>
    find_computers()
  File "c:\Users\hp\Desktop\python_projects\test3.py", line 33, in find_computers
    with pd.ExcelWriter("output.xlsx", engine="openpyxl", mode="a", if_sheet_exists="replace") as writer:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\hp\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\excel\_openpyxl.py", line 61, in __init__
    super().__init__(
  File "C:\Users\hp\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\excel\_base.py", line 1263, in __init__  
    self._handles = get_handle(
                    ^^^^^^^^^^^
  File "C:\Users\hp\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\common.py", line 872, in get_handle
    handle = open(handle, ioargs.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'output.xlsx'


这是我的实际代码:

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
import openpyxl
from bs4 import Tag

def find_computers():
  n=1
  while n<=50:
    html_text=requests.get(f"https://www.jumia.ma/catalog/?q=pc&page={n}#catalog-listing").text
    soup=BeautifulSoup(html_text,"lxml")
    computers=soup.find_all("a",class_="core")
    df={"price": [],"original price": [],"promo":[]}
    computer_name_list=[]
    for computer in computers:
        computer_name=computer.find("h3",class_="name").text.strip()
        price=computer.find("div",class_="prc").text.strip()
        original_price_element=computer.find("div",class_="old")
        original_price=original_price_element.text.strip() if isinstance(original_price_element, Tag) else "N/A"
        promo_element = computer.find("div", class_="bdg _dsct _sm")
        promo = promo_element.text.strip() if isinstance(promo_element, Tag) else "N/A"
        df["price"].append(price)
        df["original price"].append(original_price)
        df["promo"].append(promo)
        computer_name_list.append(computer_name)
    n+=1
  pop=pd.DataFrame(df,index=computer_name_list)
  pd.set_option('colheader_justify', 'center')
  with pd.ExcelWriter("output.xlsx") as writer:
      pop.to_excel(writer,sheet_name="sheet2")
  

if __name__=="__main__":
  while True:
        find_computers()
        time_s = 10
        time.sleep(6 * time_s)

lfapxunr

lfapxunr1#

对我来说也很好,所以除了避免创建空的嵌套框架和在循环中追加一个例子,如何存储结果和在集合后创建嵌套框架:

import pandas as pd
from bs4 import BeautifulSoup
import requests, time

def find_computers():
    n=1
    data = []
    while n<=5:
        html_text=requests.get(f"https://www.jumia.ma/catalog/?q=pc&page={n}#catalog-listing").text
        soup=BeautifulSoup(html_text,"lxml")
        computers=soup.find_all("a",class_="core")

        for computer in computers:

            data.append({
                'name':computer.find("h3",class_="name").text.strip(),
                'price':computer.find("div",class_="prc").text.strip(),
                'original_price':computer.find("div",class_="old").text.strip() if computer.find("div",class_="old") else None,
                'promo_element':computer.find("div", class_="bdg _dsct _sm").text.strip() if computer.find("div", class_="bdg _dsct _sm") else None

            })
        n+=1

        time_s = 10
        time.sleep(6 * time_s)

    return data

if __name__=="__main__":

    data = find_computers()
    with pd.ExcelWriter("output.xlsx") as writer:
        pd.DataFrame(data).to_excel(writer,sheet_name="sheet1")

字符串

ajsxfq5m

ajsxfq5m2#

您需要将df起始行移动到循环之外。请参见示例:

def find_computers():
    n=1
    df={"price": [],"original price": [],"promo":[]}

字符串

jgovgodb

jgovgodb3#

其中一个问题是,你不能附加一个df/sheet到一个 * 不存在的 * Excel文件。
你应该有一个预先创建的文件,或者在开始这个过程之前创建一个:

from pathlib import Path

OUTPUT_FP = "output.xlsx"
if not Path(OUTPUT_FP).exists():
    pd.DataFrame().to_excel(OUTPUT_FP)

字符串
这里有一个建议,保存计算机信息(* 每页 *)在一个单独的ss实时:

import pandas as pd
from bs4 import BeautifulSoup
import requests, time

KW = "pc"

URL = "https://www.jumia.ma/catalog/?q={kw}&page={pn}#catalog-listing"

def find_computers(sleep):
    kw_found = True; n=1
    
    while kw_found:
        soup = BeautifulSoup(requests.get(URL.format(kw=KW, pn=n)).text, "lxml")
                
        data = []
        for c in soup.find_all("a", class_="core"):
            data.append(
                {
                    "name": c.find("h3", class_="name").get_text(strip=True),
                    "price": c.find("div", class_="prc").get_text(strip=True),
                    "original price": op.get_text(strip=True) if (op:= c.find("div", class_="old")) else "N/A",
                    "promo": pr.get_text(strip=True) if (pr:=c.find("div", class_="bdg _dsct _sm")) else "N/A"
                }
            )
    
        with pd.ExcelWriter(OUTPUT_FP, engine="openpyxl",
            mode="a", if_sheet_exists="replace"
        ) as writer:
            pd.DataFrame(data).set_index("name").to_excel(writer, sheet_name=f"Sheet{n}")
            
        kw_found = soup.find("h2").get_text(strip=True) != f'Aucun résultat pour "{KW}".'
        
        n+=1
        
        time.sleep(sleep)

if __name__ == "__main__":
    find_computers(sleep=2)


预览(* 输出.xlsx,带2页 *):
x1c 0d1x的数据

相关问题