我有一个数据集,它由"日期_时间"、"ID"、"值1"、"值2"、"值3"、"值4"、"型号"、"已售出"、"检验"、"型号"、"时间"、"循环_零件"等列组成。ID列值通常是数字,但也有缩写值。
import numpy as np
import pandas as pd
import random
df = pd.DataFrame({'DATE_TIME': pd.date_range('2022-11-01', '2022-11-06 23:00:00', freq='20min'),
'ID': [random.randrange(1, 20) for n in range(430)]})
df['VALUE1'] = [random.randrange(110, 140) for n in range(430)]
df['VALUE2'] = [random.randrange(50, 60) for n in range(430)]
df['VALUE3'] = [random.randrange(80, 100) for n in range(430)]
df['VALUE4'] = [random.randrange(30, 50) for n in range(430)]
df['MODEL'] = [random.randrange(1, 3) for n in range(430)]
df['SOLD'] = [random.randrange(0, 2) for n in range(430)]
df['INSPECTION'] = df['DATE_TIME'].dt.day
df['MODE'] = np.select([df['INSPECTION'] == 1, df['INSPECTION'].isin([2, 3])], ['A', 'B'], 'C')
df['TIME'] = df['DATE_TIME'].dt.time
# df['TIME'] = pd.to_timedelta(df['TIME'])
df['TIME'] = df['TIME'].astype('str')
# Create DAY Night columns only-------------------------------------------------------------------------
def cycle_day_period(dataframe: pd.DataFrame, midnight='00:00:00', start_of_morning='06:00:00',
start_of_afternoon='13:00:00',
start_of_evening='18:00:00', end_of_evening='23:00:00', start_of_night='24:00:00'):
bins = [midnight, start_of_morning, start_of_afternoon, start_of_evening, end_of_evening, start_of_night]
labels = ['Night', 'Morning', 'Morning', 'Night', 'Night']
return pd.cut(
pd.to_timedelta(dataframe),
bins=list(map(pd.Timedelta, bins)),
labels=labels, right=False, ordered=False
)
df['CYCLE_PART'] = cycle_day_period(df['TIME'], '00:00:00', '06:00:00', '13:00:00', '18:00:00', '23:00:00', '24:00:00')
- 我的期望**:是在'VALUE1'、'VALUE2'、'VALUE3'、'VALUE4'值中查找最相似或相同的值。同时,Model列应该相同,而SOLD列应该不同。
到目前为止,这是我想出的:
import numpy as np
from scipy.spatial import distance_matrix
model_groups = df.groupby(by=['MODEL'])
smallest = []
for group_name, df_model in model_groups:
sold_1 = df_model.loc[df_model['SOLD']==1]
sold_0 = df_model.loc[df_model['SOLD']==0]
values_1 = sold_1[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
values_0 = sold_0[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
mtrx = distance_matrix(values_1, values_0)
ij_min = np.unravel_index(mtrx.argmin(), mtrx.shape)
smallest.append([group_name,sold_1.iloc[ij_min[0]], sold_0.iloc[ij_min[1]]])
- 此脚本仅显示4个相同的距离**我如何修改此代码,以便可以找到给定点的最近点。我希望写入索引或行值作为输入,然后我希望获得与输入最接近/最相似的点。但是,MODEL列应该相同,SOLD列应该不同。
1条答案
按热度按时间oxiaedzo1#
我认为你可以用欧几里德距离计算来有效地解决这个问题,这听起来可能有点吓人,但在Pandas中实现起来相当简单
这对你有用吗?