我写代码从ebay.com上抓取汽车信息(标题、品牌、型号、传输、年份、价格)数据并保存在mysql中,如果所有行的(标题、品牌、型号等)项目与另一行相似,则避免将此数据插入mysql,*仅当所有行的项目相似时(因为某些标题相似或某些型号或…)
代码:
import requests
from bs4 import BeautifulSoup
import re
import mysql.connector
conn = mysql.connector.connect(user='root', password='******',
host='127.0.0.1', database='web_scraping')
cursor = conn.cursor()
url = 'https://www.ebay.com/b/Cars-Trucks/6001?_ fsrp=0&_sacat=6001&LH_BIN=1&LH_ItemCondition=3000%7C1000%7C2500&rt=nc&_stpos=95125&Model%2520Year=2020%7C2019%7C2018%7C2017%7C2016%7C2015'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
ebay_cars = soup.find_all('li', class_='s-item')
for car_info in ebay_cars:
title_div = car_info.find('div', class_='s-item__wrapper clearfix')
title_sub_div = title_div.find('div', class_='s-item__info clearfix')
title_p = title_sub_div.find('span', class_='s-item__price')
title_tag = title_sub_div.find('a', class_='s-item__link')
title_maker = title_sub_div.find('span', class_='s-item__dynamic s-
item__dynamicAttributes1')
title_model = title_sub_div.find('span', class_='s-item__dynamic s-
item__dynamicAttributes2')
title_trans = title_sub_div.find('span', class_='s-item__dynamic s-
item__dynamicAttributes3')
name_of_car = re.sub(r'\d{4}', '', title_tag.text)
maker_of_car = re.sub(r'Make: ','', title_maker.text)
model_of_car = re.sub(r'Model: ', '', title_model.text)
try:
if title_trans.text.startswith(r'Transmission: '):
trans_of_car = re.sub(r'Transmission: ', '', title_trans.text)
else:
trans_of_car = ''
except AttributeError:
trans_of_car = ''
year_of_car = re.findall(r'\d{4}', title_tag.text)
year_of_car = ''.join(str(x) for x in year_of_car)
price_of_car = title_p.text
print(name_of_car ,trans_of_car )
sql = 'INSERT INTO car_info(Title, Maker, Model, Transmission, Year, Price)
VALUES (%s, %s, %s, %s, %s, %s)'
cursor.execute(sql , (name_of_car, maker_of_car, model_of_car, trans_of_car,
year_of_car, price_of_car))
conn.commit()
conn.close()
3条答案
按热度按时间7gcisfzg1#
一个选项使用
not exists
:但在表的所有列上创建唯一键会更简单,如:
这可以防止任何进程在表中插入重复项。你可以优雅地忽略错误,否则会提出与
on duplicate key
语法:6rvt4ljy2#
您可以将此查询的结果保存到变量中
然后,如果变量的值
1,跳过insert,因为表中已有此条目
0,则进行插入,因为表中没有该条目
这只是一种方法。
x0fgdtte3#
将主键声明为表中的所有列。请参见:https://www.mysqltutorial.org/mysql-primary-key/