如何在Pandas中的某些列上基于多个条件找到相似的数据点?

eqoofvh9  于 2023-01-07  发布在  其他
关注(0)|答案(2)|浏览(95)

我有一个数据集,它由“日期_时间”、“ID”、“值1”、“值2”、“值3”、“值4”、“型号”、“已售出”、“检验”、“型号”、“时间”、“循环_零件”等列组成。ID列值通常是数字,但也有缩写值。

import numpy as np
import pandas as pd
import random

df = pd.DataFrame({'DATE_TIME': pd.date_range('2022-11-01', '2022-11-06 23:00:00', freq='20min'),
                   'ID': [random.randrange(1, 20) for n in range(430)]})

df['VALUE1'] = [random.randrange(110, 140) for n in range(430)]
df['VALUE2'] = [random.randrange(50, 60) for n in range(430)]
df['VALUE3'] = [random.randrange(80, 100) for n in range(430)]
df['VALUE4'] = [random.randrange(30, 50) for n in range(430)]

df['MODEL'] = [random.randrange(1, 3) for n in range(430)]

df['SOLD'] = [random.randrange(0, 2) for n in range(430)]

df['INSPECTION'] = df['DATE_TIME'].dt.day

df['MODE'] = np.select([df['INSPECTION'] == 1, df['INSPECTION'].isin([2, 3])], ['A', 'B'], 'C')

df['TIME'] = df['DATE_TIME'].dt.time
# df['TIME'] = pd.to_timedelta(df['TIME'])
df['TIME'] = df['TIME'].astype('str')

# Create DAY Night columns only-------------------------------------------------------------------------
def cycle_day_period(dataframe: pd.DataFrame, midnight='00:00:00', start_of_morning='06:00:00',
                     start_of_afternoon='13:00:00',
                     start_of_evening='18:00:00', end_of_evening='23:00:00', start_of_night='24:00:00'):
    bins = [midnight, start_of_morning, start_of_afternoon, start_of_evening, end_of_evening, start_of_night]
    labels = ['Night', 'Morning', 'Morning', 'Night', 'Night']

    return pd.cut(
        pd.to_timedelta(dataframe),
        bins=list(map(pd.Timedelta, bins)),
        labels=labels, right=False, ordered=False
    )

df['CYCLE_PART'] = cycle_day_period(df['TIME'], '00:00:00', '06:00:00', '13:00:00', '18:00:00', '23:00:00', '24:00:00')

**我的期望:**是在“VALUE 1”、“VALUE 2”、“VALUE 3”、"VALUE 4“值中查找最相似或相同的值。同时,”型号“列应相同,而”销售“列应不同。

例如,我有以下数据表:
| 身份证|数值1|数值2|数值3|数值4|模式|售出|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|
| 二十五|五十|八十八|三十二|八十一|1个|无|
| 二十五|八十|二十二|十九|二十二|第二章|无|
| 二十五|一百|四十四|七十二|五十四|1个|无|
| 十八|九十九|二十四|二十九|二十二|第二章|1个|
| 十八|五十五|六十四|四十六|六十八|1个|1个|
| 十八|四十四|八十九|一百一十五|二十三|第二章|1个|
我希望第2行和第4行是模式2的输出,第1行和第5行是模式1的输出。我怎样才能实现这个输出呢?我尝试了多个布尔值,但最终都出错了。

wlzqhblo

wlzqhblo1#

试试这个。还没试过。

import numpy as np
from scipy.spatial import distance_matrix
model_groups = df.groupby(by=['MODEL'])

smallest = []
for group_name, df_model in model_groups:
    sold_1 = df_model.loc[df_model['SOLD']==1]
    sold_0 = df_model.loc[df_model['SOLD']==0]
    values_1 = sold_1[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
    values_0 = sold_0[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
    mtrx = distance_matrix(values_1, values_0)
    ij_min = np.unravel_index(mtrx.argmin(), mtrx.shape)
    smallest.append([group_name,sold_1.iloc[ij_min[0]], sold_0.iloc[ij_min[1]]])
p4tfgftt

p4tfgftt2#

这应该可以实现,但它确实很慢,因为它计算所有行之间的欧氏距离。但它应该演示了基本思想。如果你想更快,你可以查看matrixprofile librarysmallest中,然后存储[[model_number1,smallest pair],[model_number2,smallest pair]]

import numpy as np
model_groups = df.groupby(by=['MODEL'])

def euclid_distance(a,b):
    return np.sqrt(np.sum((a-b)**2))

smallest = []
for group_name, df_model in model_groups:
    sold_1 = df_model.loc[df_model['SOLD']==1]
    sold_0 = df_model.loc[df_model['SOLD']==0]
    distances = []
    for _, row1 in sold_1.iterrows():
       
        for _, row2 in sold_0.iterrows():
            dist = euclid_distance(row1.loc[['VALUE1','VALUE2','VALUE3','VALUE4']],row2.loc[['VALUE1','VALUE2','VALUE3','VALUE4']])
            distances.append([row1, row2,  dist])
    
    s = sorted(distances, key=lambda x: x[2])
    smallest.append([group_name,s[0]])

这里smallest[0][1][0]smallest[0][1][1]为您提供了与MODEL 0最接近的两行

相关问题