csv 使用Pandas请求数据并将响应写入文件的最有效方法

izj3ouym  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(94)

我正在处理的问题是找到存储在Excel文件中的大量地址(如First St @ Second Ave)的lat,long坐标对,并将新数据写入CSV文件。要将地址实际转换为经纬度对,我使用了ArcGIS REST API地理编码服务(https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?
我为此创建了一个函数,但运行大约1600行数据需要大约30分钟。我相信它可以做得更快,但我不知道我可以在哪里改进。
这是我创建的函数。它引用了两个helper函数:一个用于设置请求的参数,另一个用于将格式化的响应写入文件。

url = "https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?"
params = {'f':'pjson','outFields':'Addr_type','forStorage':'false','City':'Oakville','Region':'ON'}
path = EXCEL_FILE_PATH

import pandas as pd
import requests
df = pd.read_excel(path,sheet_name="line_pts",usecols="A",dtype=str)

def writeToFile(path,data):
    with open(path,"a",newline="") as outFile:
      writer = csv.writer(outFile,dialect='excel')
          writer.writerow(data)

def setParams(val):
        params['Address'] = val
        return params

def search(query,savePath):
    import requests
    nonlocated = []
  
    for q in query:
        setParams(q)
        data = requests.get(url,params).json()
        data_dict = data.get('candidates')[0]

        if data_dict.get('score') >= 95:
            cols = ["Address","Type","Lat","Long"]
            dat = [data_dict.get('address'),data_dict.get('attributes').get('Addr_type'),data_dict.get('location').get('y'),data_dict.get('location').get('x')]
            writeToFile(savePath,dat)
        else:
            nonlocated.append(q)

    with open(r"C:\Users\PC\Documents\missing.txt","w") as missing:
        missing.write('Missed/Non-located places'  + "\n")
        for elem in nonlocated:
            missing.write(str(elem) + "\n")
                    
    print(f"{savePath} was written successfully. % missing data = {len(nonlocated)/len(df['DATA'])*100}")
        

search(df["DATA"],"INPUT-PATH")

我设置逻辑的方式,似乎主要的低效率是每个请求都必须先完成,然后才能将单个记录写入文件。也许有一种方法可以并行执行这两个任务,但我甚至无法想象在这种情况下程序逻辑会是什么样子。

af7jpaap

af7jpaap1#

使用带有请求会话的线程池,我能够在大约8分钟内拉取1600 'First St @ Second Ave'。当然,这可能会随着你的名单而变化。
我喜欢imap_unordered,因为结果是“as completed”。如果顺序重要,则使用imap
最初运行时我使用的是Pool(),它默认为cpu计数(在我的桌面上是16),但很快就得到了一个json错误。将数量减少到Pool(8),运行正常。所以也许有一些速率或连接限制,我没有花时间调查。

from multiprocessing.dummy import Pool
from pprint import pp
import csv, requests

timeout = 10
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
DATA = ['First St @ Second Ave'] * 1600

def worker(val):
    url = 'https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?'
    params = {
        'f': 'pjson',
        'outFields': 'Addr_type',
        'forStorage': 'false',
        'City': 'Oakville',
        'Region': 'ON',
        'Address': val
    }

    try:
        data = session.get(url, params=params, timeout=timeout).json()
        return (val, data)
    except Exception as e:
        pp([type(e), e, val])
        return (val, 'missing data')

def search(query, savePath):
    with Pool(8) as pool, open(savePath, 'w', newline='') as outFile:
        query_len = len(query)
        nonlocated = []
        writer = csv.writer(outFile, dialect='excel')
        writer.writerow(['Address', 'Type', 'Lat', 'Long'])
        
        threads = enumerate(pool.imap_unordered(worker, query), 1)
        for i, (val, result) in threads:
            print(f'result {i} of {query_len}')
            if 'missing data' in result:
                nonlocated.append(val)
            else:
                data_dict = result.get('candidates')[0]
                if data_dict.get('score') >= 95:
                    dat = [
                        data_dict.get('address'),
                        data_dict.get('attributes').get('Addr_type'),
                        data_dict.get('location').get('y'),
                        data_dict.get('location').get('x'),
                    ]
                    writer.writerow(dat)
                else:
                    nonlocated.append(val)
    pool.join()
    
    with open('missing.txt', 'w') as missing:
        for elem in nonlocated: print(elem, file=missing)
    
    print(f'{savePath} was written successfully. % missing data = {len(nonlocated) / query_len * 100}')

if __name__ == '__main__':
    # search(df['DATA'],'INPUT-PATH')
    search(DATA, 'INPUT-PATH')

相关问题