numpy 在Python中获得k-最近邻最优匹配,无需替换

djmepvbi  于 12个月前  发布在  Python
关注(0)|答案(1)|浏览(112)

我想为每个项目地块分配两个控制地块。在第一次迭代中,我想将最近的控制地块(具有最小值dist_project_plot_id*)分配给正在评估的项目地块。如果此控制地块已分配给项目地块,我们将查找下一个最近的控制地块。
一旦所有的项目地块都被分配了第一个控制地块,我们就为每个项目地块分配第二个控制地块,遵循相同的标准:找到距离最小的控制地块,只要它以前没有被分配给另一个项目地块。

I have a dataframe which looks like: 
`data = {
    'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
    'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
    'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
    'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}

df = pd.DataFrame(data)`

字符串
其中:control_plot_id表示控制地块的标识符,dist_project_plot_id 1表示控制地块与项目地块1之间的距离,dist_project_plot_id 2表示控制地块与项目地块2之间的距离,等等。
我在下面的代码中尝试了第一次搜索:

import pandas as pd
df = pd.DataFrame(data)

# Add new columns "PP" and "dist"
df['PP'] = ''
df['dist'] = np.nan

# Get the column names starting with 'project_plot_id'
project_columns = [col for col in df.columns if col.startswith('project_plot_id')]

# Iterate over the project_plot_id columns
for col in project_columns:
    # Sort the dataframe by the current column in ascending order
    df_sorted = df.sort_values(col)

    # Find the k-nearest control plots for the current column
    k = 1  # Set the value of k
    nearest_control_plots = []
    for i in range(k):
        min_value = df_sorted.loc[~df_sorted['control_plot_id'].isin(nearest_control_plots)].head(1)[['control_plot_id', col]]
        nearest_control_plots.append(min_value['control_plot_id'].values[0])
        df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'PP'] = col
        df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'dist'] = min_value[col].values[0]


我无法编程的是,如果已经为项目图选择了一个控制图,代码应该继续搜索下一个最近的控制图,可能是第三个,第四个,甚至是最后一个。
预期输出应包含以下列:

*control_plot_id:原始控制图id
*PP:已选择控制地块的项目地块
*dist:控制图与所选控制图之间的距离

lyr7nygr

lyr7nygr1#

根据描述,这就是你要找的:

import pandas as pd

data = {
    'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
    'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
    'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
    'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}

df = pd.DataFrame(data)
cols = [c for c in df.columns if c.startswith("dist")]

project_data = []
assignments = []

for c in cols:
    for i in range(1, 3):
        row = {}
        search_df = df.loc[~df.control_plot_id.isin(assignments)]
        control_plot = search_df.loc[search_df[c].idxmin()]
        row["project_plot"] = c
        row["control_plot"] = control_plot.control_plot_id
        row["dist"] = control_plot[c]
        project_data.append(row)
        assignments.append(control_plot.control_plot_id)

out_data = pd.DataFrame(project_data)

print(out_data)

字符串
这将返回:

project_plot  control_plot     dist
0  dist_project_plot_id1     1528123.0  1697.68
1  dist_project_plot_id1     1539206.0  2140.41
2  dist_project_plot_id2     1507770.0   427.82
3  dist_project_plot_id2     1526258.0   488.07
4  dist_project_plot_id3     1504105.0  2008.31
5  dist_project_plot_id3     2019722.0  5573.02


这就是你所要求的:
我想指定最近的控制地块如果这个控制区已经被分配给一个项目区,我们将寻找下一个最近的控制区。一旦所有项目区都被分配了第一个控制区,我们将按照相同的标准为每个项目区分配第二个控制区:找到具有最小距离的控制地块,只要它以前没有被指定给另一个项目地块。
然而,你的示例输出与你解释你想要的不同,不知道为什么。
编辑后注意:如果你想要每个项目一行,那么,就像我在编辑之前澄清你的预期输出时的回答一样:

project_data = {}
assignments = []

for c in cols:
    project_data[c] = {}
    for i in range(1, 3):
        search_df = df.loc[~df.control_plot_id.isin(assignments)]
        control_plot = search_df.loc[search_df[c].idxmin()]
        project_data[c][f"control_plot{i}"] = control_plot.control_plot_id
        project_data[c][f"dist{i}"] = control_plot[c]
        assignments.append(control_plot.control_plot_id)

out_data = pd.DataFrame.from_dict(project_data, orient='index').reset_index()
out_data.rename(columns={'index': 'project'}, inplace=True)


它将返回:

project  control_plot1    dist1  control_plot2    dist2
0  dist_project_plot_id1      1528123.0  1697.68      1539206.0  2140.41
1  dist_project_plot_id2      1507770.0   427.82      1526258.0   488.07
2  dist_project_plot_id3      1504105.0  2008.31      2019722.0  5573.02

相关问题