scrapy 多网拼读独典

798qvoo8  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(101)

我正在用scrapy抓取几个网站,我的输出创建了一个dict列表(每个网站一个)。我希望我的输出只创建一个dict。我试过使用 meta,但我不太理解它,我不能让它工作。
这是我的代码:

class TransferSpider(scrapy.Spider):     
    # name of the spider
    name = 'transfers'
    # list of urls to scraping
    start_urls = ['https://www.transfermarkt.es/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0/leihe//datum/2022-07-10/sort//plus/1',
                 'https://www.transfermarkt.es/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0/leihe//datum/2022-07-10/sort//plus/1/page/2']

    custom_settings={"FEEDS":{"players.json" : {"format" : "json", 'encoding':'utf-8', 'indent':4}}}

    def parse(self, response):
        # Extract all text from table
        data = response.xpath("//*[@id='yw1']/table/tbody//table//text()").extract()
        # Delete space
        data = map(str.strip, data)
        # Take no empty elements
        data = list(filter(lambda x: (x != ''), data))
        #print(data)
        yield {
            'names': data[0::6],
            'position': data[1::6],
            'origin_club': data[2::6],
            'leage_origin_club': data[3::6],
            'new_club': data[4::6],
            'leage_new_club': data[5::6]
        }

可能解决方法不是很难,但我无法得到它
我想要的输出是:

{
    Names:[list whit names],
    Position:[list with positions]
...
}
ctehm74n

ctehm74n1#

你不需要指定想要的dict结果......也没有人可以阻止你使用复杂的解决方案。然而,这项工作可以用一种简单的方式来完成,使用python、requests、BeautifulSoup和panda:

import requests
from bs4 import BeautifulSoup
import pandas as pd

final_list = []

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}

for x in range(1, 7):
    r = requests.get(f'https://www.transfermarkt.es/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0/leihe//datum/2022-07-10/sort//plus/2/page/{x}', headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    big_table = soup.select('table.items>tbody>tr')
    for row in big_table:
        name = row.find_all('td',  recursive=False)[0].select('td')[1]
        position = row.find_all('td',  recursive=False)[0].select('td')[2]
        age = row.find_all('td',  recursive=False)[1]
        nationality = row.find_all('td',  recursive=False)[2].select_one('img')['alt']
        origin_club = row.find_all('td',  recursive=False)[3].select('td')[1]
        origin_club_league = row.find_all('td',  recursive=False)[3].select('td')[2]
        new_club = row.find_all('td',  recursive=False)[4].select('td')[1]
        new_club_league = row.find_all('td',  recursive=False)[4].select('td')[2]
        value_when_transferred = row.find_all('td',  recursive=False)[5]
        cost = row.find_all('td',  recursive=False)[6]
        final_list.append((name.text.strip(), age.text.strip(), 
                           position.text.strip(), nationality, 
                           origin_club.text.strip(), origin_club_league.text.strip(), 
                           new_club.text.strip(), new_club_league.text.strip(), 
                           value_when_transferred.text.strip(),cost.text.strip()))
final_df = pd.DataFrame(final_list, columns = ['Name', 'Age', 'Position', 'Nationality', 
                        'Origin Club', 'Origin Club league', 'New Club', 'New Club League', 
                        'Value when transferred', 'Cost'])
final_df

这个 Dataframe 可以做成一个法令:

final_dict = final_df.to_dict()
final_dict

编辑:由于您现在正在确认所需的最终词典,您可以执行以下操作:

final_dict = {}
final_dict['Names'] = final_df['Name'][:2].tolist()
final_dict['Ages'] = final_df['Age'][:2].tolist()
final_dict['Positions'] = final_df['Position'][:2].tolist()
final_dict

将返回:

{'Names': ['Neco Williams', 'Omar Richards'],
 'Ages': ['21', '24'],
 'Positions': ['Lateral derecho', 'Lateral izquierdo']}
x6yk4ghg

x6yk4ghg2#

根据你张贴的标签scrapy并输出为字典,你可以尝试下一个例子:

import scrapy
from scrapy.crawler import CrawlerProcess

class TransferSpider(scrapy.Spider):     
    name = 'transfers'
    start_urls = ['https://www.transfermarkt.es/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0/leihe//datum/2022-07-10/sort//plus/1/page/'+str(x)+'' for x in range(1,3)]

    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        "FEEDS": {'data.json': {'format': 'json'}},
        "FEED_EXPORT_ENCODING": "utf-8",
        "INDENT":4
          }

    def parse(self, response):
        #for tr in response.xpath('//*[@class="items"]/tbody/tr'):

        yield {
            'names': response.xpath('//*[@class="items"]/tbody/tr/td[1]/table/tr[1]/td[2]/a/text()').getall(),
            'position': response.xpath('//*[@class="items"]/tbody/tr/td[1]/table/tr[2]/td/text()').getall(),
            'origin_club':response.xpath('//*[@class="items"]/tbody/tr/td[4]/table/tr/td[2]/a/text()').getall(),
            'leage_origin_club':response.xpath('//*[@class="items"]/tbody/tr/td[4]/table/tr[2]/td/a/text()').getall(),
            'new_club': response.xpath('//*[@class="items"]/tbody/tr/td[5]/table/tr[1]/td[2]/a/text()').getall(),
            'leage_new_club': response.xpath('//*[@class="items"]/tbody/tr/td[5]/table/tr[2]/td/a/text()').getall()
        }

if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(TransferSpider)
    process.start()

输出:

{'names': ['Diego Vita', 'Julani Archibald', 'Alessio Benedetti', 'Santino Misale', 'Panagiotis Arnaoutoglou', 'Tauã', 'Igor Zeetti', 'Matías Nahuel', 'Hakob 
Hakobyan', 'Vojtech Brak', 'Jordi Calavera', 'Igor Kurylo', 'Aleksey Chubukin', 'Adrián Jiménez', 'Jesús del Amo', 'Giovanni Romano', 'Giuseppe Lopez', 'Sagas Tambi', 'Pedro Justiniano', 'Insar Salakhetdinov', 'Francesco Mele', 'Sina Moridi', 'Julen Monreal', 'Mahmoud Motlaghzadeh', 'Katriel Islamaj'], 'position': 
['Extremo derecho', 'Portero', 'Mediocentro', 'Lateral izquierdo', 'Lateral izquierdo', 'Extremo izquierdo', 'Defensa central', 'Extremo izquierdo', 'Lateral 
izquierdo', 'Defensa central', 'Lateral derecho', 'Defensa central', 'Defensa central', 'Defensa central', 'Defensa central', 'Extremo derecho', 'Delantero centro', 'Pivote', 'Defensa central', 'Pivote', 'Pivote', 'Pivote', 'Defensa central', 'Mediocentro', 'Mediocentro'], 'origin_club': ['Sanremese ', 'Santa Lucia FC', 'Arezzo', 'Gioiese', 'PASA Irodotos', 'PT Prachuap FC', 'Montespaccato', 
'CD Tenerife', 'FC Urartu ', 'Usti nad Labem', 'Girona FC', 'Agrobiznes V.', 'FK Saransk', 'CD Toledo', 'Ast. Vlachioti', 'Portici', 'Brindisi', 'Bnei Yehuda', 'Coimbra', 'Biolog', 'Chieti FC', 'Foolad', 'UE Costa Brava', 'Sanat Naft', 'Real Forte'], 'leage_origin_club': ['Serie D - A', 'Premier League

...等等

相关问题