如何从一个csv文件中调用URL,抓取它们,将输出分发到第二个csv文件中?

x8goxv8g  于 2023-03-27  发布在  其他
关注(0)|答案(1)|浏览(109)
  • Python 3.11.2; PyCharm 2022.3.3 (Community Edition) - Build PC-223.8836.43; OS: Windows 11 Pro, 22H2, 22621.1413; Chrome 111.0.5563.65 (Official Build) (64-bit)*

The editing box behaved wonkily, so I have omitted a couple of the intermediate attempts that were unsuccessful.
Is there a way to (1) call URLs in a one-column 10-item list contained in a csv (i.e., "caselist.csv"); and (2) execute a scraping script for each of those URLs (see below) and output all the data to a second csv file ("caselist_output.csv") in which the output is distributed in columns (i.e., case_title, case_plaintiff, case_defendant, case_number, case_filed, case_filed, court, case_nature_of_suit, case_cause_of_action, jury_demanded) and rows (each of the 10 cases contained in the csv file)?
The ten URLs contained in caselist.csv are:

https://dockets.justia.com/docket/alabama/alndce/6:2013cv01516/148887
https://dockets.justia.com/docket/arizona/azdce/2:2010cv02664/572428
https://dockets.justia.com/docket/arkansas/aredce/4:2003cv01507/20369
https://dockets.justia.com/docket/arkansas/aredce/4:2007cv00051/67198
https://dockets.justia.com/docket/arkansas/aredce/4:2007cv01067/69941
https://dockets.justia.com/docket/arkansas/aredce/4:2008cv00172/70993
https://dockets.justia.com/docket/arkansas/aredce/4:2008cv01288/73322
https://dockets.justia.com/docket/arkansas/aredce/4:2008cv01839/73965
https://dockets.justia.com/docket/arkansas/aredce/4:2008cv02513/74818
https://dockets.justia.com/docket/arkansas/aredce/4:2008cv02666/74976

After failing miserably with my own scripts, I tried @Driftr95's two suggestions:

from bs4 import BeautifulSoup
import requests
import csv

th_fields = { 'case_plaintiff': 'Plaintiff', 'case_defendant': 'Defendant', 'case_number': 'Case Number',
              'case_filed': 'Filed', 'court': 'Court', 'case_nature_of_suit': 'Nature of Suit',
              'case_cause_of_action': 'Cause of Action',  'jury_demanded': 'Jury Demanded By' }
fgtParams = [('div', {'class': 'title-wrapper'})] + [('td', {'data-th': f}) for f in th_fields.values()]

with open('caselist.csv') as f:
    links = [l.strip() for l in f.read().splitlines() if l.strip().startswith('https://dockets.justia.com/docket')]

def find_get_text(bsTag, tName='div', tAttrs=None):
    t = bsTag.find(tName, {} if tAttrs is None else tAttrs)
    if t: return t.get_text(' ',strip=True) # safer as a conditional

def scrape_docketsjustia(djUrl, paramsList=fgtParams):
    soup = BeautifulSoup((r:=requests.get(djUrl)).content, 'lxml')
    cases_class = 'wrapper jcard has-padding-30 blocks has-no-bottom-padding'
    cases = soup.find_all('div', class_=cases_class)

    # print(f'{len(cases)} cases <{r.status_code} {r.reason}> from {r.url}')
    return [[find_get_text(c, n, a) for n, a in paramsList] for c in cases]

all_ouputs = []
for url in links:
    all_ouputs += scrape_docketsjustia(url)

with open("posts/caselist_output.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(['case_title', *th_fields]) # [ header row with column names ]
    writer.writerows(all_ouputs)

This script did not produce any output. Not really sure what's going on...
I also tried @Driftr95's second suggestion:

import requests
from bs4 import BeautifulSoup
import pandas as pd # [I just prefer pandas]

input_fp = 'caselist.csv'
output_fp = 'caselist_output.csv'
th_fields = { 'case_plaintiff': 'Plaintiff', 'case_defendant': 'Defendant', 'case_number': 'Case Number',
              'case_filed': 'Filed', 'court': 'Court', 'case_nature_of_suit': 'Nature of Suit',
              'case_cause_of_action': 'Cause of Action',  'jury_demanded': 'Jury Demanded By' }
fgtParams = [('case_title', 'div', {'class': 'title-wrapper'})] + [(k, 'td', {'data-th': f}) for k,f in th_fields.items()]
## function definitions ##

def find_get_text(bsTag, tName='div', tAttrs=None):
    t = bsTag.find(tName, {} if tAttrs is None else tAttrs)
    if t: return t.get_text(' ',strip=True)

def scrape_docketsjustia(djUrl, paramsList=fgtParams):
    soup = BeautifulSoup((r:=requests.get(djUrl)).content, 'lxml')
    cases_class = 'wrapper jcard has-padding-30 blocks has-no-bottom-padding'
    for c in soup.find_all('div', class_=cases_class):
        return {k:find_get_text(c,n,a) for k,n,a in paramsList}

    # return {} # just return empty row if cases_class can't be found
    return {'error_msg': f'no cases <{r.status_code} {r.reason}> from {r.url}'}
## main logic ##

## load list of links
# links = list(pd.read_csv(input_fp, header=None)[0]) # [ if you're sure ]
links = [l.strip() for l in pd.read_csv(input_fp)[0] # header will get filtered anyway
         if l.strip().startswith('https://dockets.justia.com/docket/')] # safer

## scrape for each link
df = pd.DataFrame([scrape_docketsjustia(u) for u in links])
# df = pd.DataFrame(map(scrape_docketsjustia,links)).dropna(axis='rows') # drop empty rows
# df['links'] = links # [ add another column with the links ]

## save scraped data
# df.to_csv(output_fp, index=False, header=False) # no column headers
df.to_csv(output_fp, index=False)

This produced the following error messages:
Traceback (most recent call last): File "C:\Users\cs\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py", line 3802, in get_loc return self._engine.get_loc(casted_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pandas_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas_libs\hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "C:\Users\cs\PycharmProjects\pythonProject1\solution2.py", line 29, in links = [l.strip() for l in pd.read_csv(input_fp)[0] # header will get filtered anyway ~~~~~~~~~~~~~~~~~~~~~^^^ File "C:\Users\cs\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 3807, in getitem indexer = self.columns.get_loc(key) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\cs\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py", line 3804, in get_loc raise KeyError(key) from err KeyError: 0
I just ran the script, which I thought worked - but now, all of a sudden, it returns no output (even with the revised links = [l.strip() for l in pd.read_csv(input_fp , header=None )[0] if l.strip().startswith('https://dockets.justia.com/docket/')] :

import requests
from bs4 import BeautifulSoup
import pandas as pd # [I just prefer pandas]

input_fp = 'caselist.csv'
output_fp = 'caselist_output.csv'
th_fields = { 'case_plaintiff': 'Plaintiff', 'case_defendant': 'Defendant', 'case_number': 'Case Number',
              'case_filed': 'Filed', 'court': 'Court', 'case_nature_of_suit': 'Nature of Suit',
              'case_cause_of_action': 'Cause of Action',  'jury_demanded': 'Jury Demanded By' }
fgtParams = [('case_title', 'div', {'class': 'title-wrapper'})] + [(k, 'td', {'data-th': f}) for k,f in th_fields.items()]
## function definitions ##

def find_get_text(bsTag, tName='div', tAttrs=None):
    t = bsTag.find(tName, {} if tAttrs is None else tAttrs)
    if t: return t.get_text(' ',strip=True)

def scrape_docketsjustia(djUrl, paramsList=fgtParams):
    soup = BeautifulSoup((r:=requests.get(djUrl)).content, 'lxml')
    cases_class = 'wrapper jcard has-padding-30 blocks has-no-bottom-padding'
    for c in soup.find_all('div', class_=cases_class):
        return {k:find_get_text(c,n,a) for k,n,a in paramsList}

    # return {} # just return empty row if cases_class can't be found
    return {'error_msg': f'no cases <{r.status_code} {r.reason}> from {r.url}'}
## main logic ##

## load list of links
# links = list(pd.read_csv(input_fp, header=None)[0]) # [ if you're sure ]
links = [l.strip() for l in pd.read_csv(input_fp , header=None )[0] if l.strip().startswith('https://dockets.justia.com/docket/')] # safer

## scrape for each link
df = pd.DataFrame([scrape_docketsjustia(u) for u in links])
# df = pd.DataFrame(map(scrape_docketsjustia,links)).dropna(axis='rows') # drop empty rows
# df['links'] = links # [ add another column with the links ]

## save scraped data
# df.to_csv(output_fp, index=False, header=False) # no column headers
df.to_csv(output_fp, index=False)
kknvjkwl

kknvjkwl1#

方案V1

  • 是否有方法(1)调用包含在csv中的一列10项列表中的URL(即“caselist.csv”)*

对于csv.reader,您可以使用类似于

# import csv
with open('caselist.csv', newline='') as f:
    links = [l for l, *_ in csv.reader(f)]

不过,由于它只是一个没有标头索引的单列,因此实际上不需要csv模块。您可以使用f.read(),如 * with open('caselist.csv') as f: links = f.read().splitlines() *,或者更安全:

with open('caselist.csv') as f:
    links = [l.strip() for l in f.read().splitlines() if l.strip().startswith('https://dockets.justia.com/docket')]
  • 和(2)为这些URL中的每一个执行抓取脚本(见下文)*

您可以将当前代码[csv.writer块除外] Package 在一个函数中,该函数将URL作为输入并返回output列表;但是您当前的代码有一些重复的部分,我认为可以简化为
一个二个一个一个
一旦你有了这个函数,你就可以循环所有的URL来收集所有的输出:

all_ouputs = []
for url in links: 
    all_ouputs += scrape_docketsjustia(url)
  • 并将所有数据输出到第二个csv文件(“caselist_output.csv”)*

您可以使用与保存output相同的方法保存all_ouputs,但如果需要,也可以使用th_fields的键作为列标题:

with open("posts/caselist_output.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(['case_title', *th_fields]) # [ header row with column names ]
    writer.writerows(all_ouputs)

方案V2

  • (csv文件中包含的10个案例中的每一个)*

一开始我没有注意到这一点,但是如果你希望每个链接只有一行,那么就没有必要从scrape_docketsjustia返回一个列表--它可以只返回那一行。

## setup ##

import requests
from bs4 import BeautifulSoup
import pandas as pd # [I just prefer pandas]

input_fp = 'caselist.csv'
output_fp = 'posts/caselist_output.csv'
th_fields = { 'case_plaintiff': 'Plaintiff', 'case_defendant': 'Defendant', 'case_number': 'Case Number', 
              'case_filed': 'Filed', 'court': 'Court', 'case_nature_of_suit': 'Nature of Suit', 
              'case_cause_of_action': 'Cause of Action',  'jury_demanded': 'Jury Demanded By' }
fgtParams = [('case_title', 'div', {'class': 'title-wrapper'})] + [(k, 'td', {'data-th': f}) for k,f in th_fields.items()]
## function definitions ##

def find_get_text(bsTag, tName='div', tAttrs=None):
    t = bsTag.find(tName, {} if tAttrs is None else tAttrs)
    if t: return t.get_text(' ',strip=True)

def scrape_docketsjustia(djUrl, paramsList=fgtParams):
    soup = BeautifulSoup((r:=requests.get(djUrl)).content, 'lxml')
    cases_class = 'wrapper jcard has-padding-30 blocks has-no-bottom-padding'
    for c in soup.find_all('div', class_=cases_class): 
        return {k:find_get_text(c,n,a) for k,n,a in paramsList}

    # return {} # just return empty row if cases_class can't be found
    return {'error_msg': f'no cases <{r.status_code} {r.reason}> from {r.url}'}
## main logic ##

## load list of links 
# links = list(pd.read_csv(input_fp, header=None)[0]) # [ if you're sure ]
links = [l.strip() for l in pd.read_csv(input_fp, header=None)[0]  
         if l.strip().startswith('https://dockets.justia.com/docket/')] # safer

## scrape for each link
df = pd.DataFrame([scrape_docketsjustia(u) for u in links])
# df = pd.DataFrame(map(scrape_docketsjustia,links)).dropna(axis='rows') # drop empty rows
# df['links'] = links # [ add another column with the links ]

## save scraped data
# df.to_csv(output_fp, index=False, header=False) # no column headers
df.to_csv(output_fp, index=False)

**新增编辑:**逐行阅读保存

修改了scrape_docketsjustia的定义[因为追加时所有行需要具有相同的列顺序,以便保持行对齐]:

def scrape_docketsjustia(djUrl, paramsList=fgtParams):
    soup = BeautifulSoup((r:=requests.get(djUrl)).content, 'lxml')
    cases_class = 'wrapper jcard has-padding-30 blocks has-no-bottom-padding'
    print(rStatus:=f'<{r.status_code} {r.reason}> from {r.url}') 

    c1 = soup.find('div', class_=cases_class)
    case1 = {k:find_get_text(c1 if c1 else soup, n, a) for k,n,a in paramsList}
    return {**case1, 'msg': rStatus}
    # return {**case1, 'msg': rStatus, 'from_link': djUrl}

并将 * ## main logic ## * 块替换为

columns = [*[k for k,*_ in fgtParams], 'msg']
# columns = [*[k for k,*_ in fgtParams], 'msg', 'from_link': djUrl]

with open(input_fp) as f:
    for li, l in enumerate(f):
        if not l.strip().startswith('https://dockets.justia.com/'): continue
        df = pd.DataFrame([scrape_docketsjustia(l.strip())])

        m, h = ('a', False) if li else ('w', True) ## new file if li==0
        df[columns].to_csv(output_fp, mode=m, index=False, header=h)

请注意,这仅适用于单列input_fp

相关问题