debugging 固定井分析

dfddblmv  于 2023-06-30  发布在  其他
关注(0)|答案(1)|浏览(73)

所以,我现在被一个bug卡住了。
我正在处理一个包含以下信息的巨大数据集:
关于许多威尔斯的多个示例的信息,每个孔都标有其自己的唯一孔ID号、镭污染水平和取样日期。
例如:

Well ID: AT091
Radium Level: 44.9
Sample Date: 3/18/2015

Well ID: AT091
Radium Level: 50.2
Sample Date: 2/18/2015

Well ID: AT091
Radium Level: 33.7 PCI/L
Sample Date: 7/28/2020

我被要求编写一个Python脚本,从原始数据集中过滤出数据,并根据以下条件创建一个新的Excel工作表:
对于每口井,如果该井每年取样一次,则保留该井。对于每口井,如果该井在一年内多次取样,则保留污染水平最高的取样日期。
例如,如果一个孔被取样三次:

Well ID: AT091
Radium Level: 44.9
Sample Date: 3/18/2015

Well ID: AT091
Radium Level: 50.2
Sample Date: 2/18/2015

Well ID: AT091
Radium Level: 33.7 PCI/L
Sample Date: 7/28/2020

代码应使用以下内容更新电子表格:

Well ID: AT091
Radium Level: 50.2
Sample Date: 2/18/2015

Well ID: AT091
Radium Level: 33.7 PCI/L
Sample Date: 7/28/2020

下面是我写的代码:

def wells_sampled_once_per_year(well_numbers, formatted_dates, concentration):
    well_count = {}
    max_contamination = {}

    for well, date, conc in zip(well_numbers, formatted_dates, concentration):
        if date is None:
            continue
        try:
            year = pd.to_datetime(date).year
        except AttributeError:
            continue
        well_year = (well, year)
        if well_year in well_count:
            well_count[well_year] += 1
            max_contamination[well_year] = max(max_contamination[well_year], conc)
        else:
            well_count[well_year] = 1
            max_contamination[well_year] = conc

    sampled_once_per_year = [
        (well, date, conc, max_contamination[(well, pd.to_datetime(date).year)])
        for well, date, conc in zip(well_numbers, formatted_dates, concentration)
        if well_count[(well, pd.to_datetime(date).year)] == 1
    ]
    return sorted(sampled_once_per_year)

def wells_sampled_multiple_times_per_year(well_numbers, formatted_dates, concentration):
    well_count = {}
    max_contamination = {}
    
    for well, date, conc in zip(well_numbers, formatted_dates, concentration):
        if date is None:
            continue
        try:
            year = pd.to_datetime(date).year
        except AttributeError:
            continue
        well_year = (well, year)
        if well_year in well_count:
            well_count[well_year] += 1
            if conc > max_contamination[well_year]:
                max_contamination[well_year] = conc
        else:
            well_count[well_year] = 1
            max_contamination[well_year] = conc
    
    sampled_multiple_times_per_year = [
        (well, date, conc, max_contamination[(well, pd.to_datetime(date).year)])
        for well, date, conc in zip(well_numbers, formatted_dates, concentration)
        if well_count[(well, pd.to_datetime(date).year)] > 1 and conc == max_contamination[(well, pd.to_datetime(date).year)]
    ]
    
    # Remove duplicates from the list
    sampled_multiple_times_per_year = list(set(sampled_multiple_times_per_year))
    
    return sorted(sampled_multiple_times_per_year)
yb3bgrhw

yb3bgrhw1#

for循环之后,max_contamination包含了几乎所有需要的信息,除了日期。为了简化返回值i的构造,我在循环中添加了它。e.将循环的最后五行改为

…
            if conc > max_contamination[well_year][1]:  # [1]: conc
                max_contamination[well_year] = (date, conc)
        else:
            well_count[well_year] = 1
            max_contamination[well_year] = (date, conc)
    return [(well, date, conc) for (well, _), (date, conc) in max_contamination.items()]

(or如果需要的话,进行排序)。

相关问题