Pandas:在数据框行内循环并返回匹配结果

kzipqqlq  于 2023-06-04  发布在  其他
关注(0)|答案(2)|浏览(213)

下面是一个数据框,在列“ID”中有两组值A和B。需要从相同列'MT' & 'Price'中的先前行中找到最小值,如果存在比当前价格值更大的价格值,则循环应中断并对以下行执行此条件。匹配的条件值需要填充在不同的列中,示例输出如下所示,

Python
df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
        'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
      'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
        'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}

df = pd.DataFrame(data)

print(df)

Available Data frame:
*******************
   ID        Date  MT   Price
0   A  01-05-2023   2  300.00
1   A  02-05-2023   5  100.50
2   A  03-05-2023  10  200.00
3   A  04-05-2023  15  150.35
4   A  06-05-2023  20  250.00
5   A  07-05-2023  30   90.00
6   B  01-05-2023   2  300.00
7   B  02-05-2023   5  100.50
8   B  03-05-2023  10  200.00
9   B  04-05-2023  15  150.35
10  B  06-05-2023  20  250.00
11  B  07-05-2023  30   90.00

Output Required:
*******************
    ID  Date        MT  Price       Matched_Price       Date_Values
0   A   01-05-2023  2   300         
1   A   02-05-2023  5   100.5           
2   A   03-05-2023  10  200        100.5                02-05-2023
3   A   04-05-2023  15  150.35          
4   A   06-05-2023  20  250     100.5,200,150.35        02-05-2023,03-05-2023,04-05-2023
5   A   07-05-2023  30  90          
6   B   01-05-2023  2   300         
7   B   02-05-2023  5   100.5                       
8   B   03-05-2023  10  200         100.5               02-05-2023
9   B   04-05-2023  15  150.35          
10  B   06-05-2023  20  250         100.5,200,150.35    02-05-2023,03- 
 05-2023,04-05-2023     
11  B   07-05-2023  30  90

下面提供的代码在识别列中的所有较小值时工作良好,但是需要应用条件的第二部分,即,如果存在比当前价格值更大的价格值,则循环应该中断并对以下行执行该条件。
for i,row in df.iterrows():dfa = df.iloc[:i,:] # slice with rows above current dfa = dfa[(dfa.ID==row.ID)&(dfa.MT < row.MT)&(dfa.Price< row.Price)] # matched rows df.loc[i,'matched_Price'] = ','. join(map(str,dfa.Price))df.loc[i,'matched_dates'] = ','. join(map(str,dfa.Date))

ne5o7dgx

ne5o7dgx1#

为了满足您的要求,我建议稍微修改您现有的代码。您需要确保一旦发现价格高于当前行的行,就停止考虑以前的行。这可以通过在切片之前按'Price'列以降序对DataFrame进行排序来实现,然后一旦遇到更高的价格就中断循环。
以下是Python中的完整答案:

import pandas as pd

df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
        'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
      'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
        'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}

df = pd.DataFrame(df)

matched_prices = []
matched_dates = []

for i, row in df.iterrows():
    dfa = df.iloc[:i, :] # slice with rows above current
    dfa = dfa[dfa.ID == row.ID] # filter by ID
    dfa = dfa.sort_values(by='Price', ascending=False) # sort by price descending
    temp_prices = []
    temp_dates = []
    for j, rowa in dfa.iterrows():
        if rowa.MT < row.MT and rowa.Price < row.Price:
            temp_prices.append(rowa.Price)
            temp_dates.append(rowa.Date)
        elif rowa.Price > row.Price:
            break
    matched_prices.append(','.join(map(str, temp_prices)))
    matched_dates.append(','.join(map(str, temp_dates)))

df['matched_Price'] = matched_prices
df['matched_dates'] = matched_dates

print(df)

这段代码的工作方式是遍历dataframe中的每一行,过滤当前行之上且具有相同ID的行。对于这些行中的每一行,如果“MT”值小于当前行并且“Price”值小于当前行,则这些值将添加到临时列表中。如果找到大于当前行的“Price”值,则循环中断,并且临时列表被追加到匹配的列表。在处理完所有行之后,将匹配的列表添加到 Dataframe 中。

ovfsdjhp

ovfsdjhp2#

我通常更喜欢在.apply()函数中执行这些操作,而不是在循环中执行。使用循环并没有错,但我发现它会使pandas world中的事情更难处理。它通常也比使用apply或其他更好的矢量化解决方案慢。
这是我的想法:

import pandas as pd

df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
        'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
      'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
        'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}

df = pd.DataFrame(df)

# Store in a copy so we can manipulate it and maintain the source
mod_df = df.copy()

# Shift gives us a series "pushed" along one entry. Allows comparison with the previous entry
mod_df["shift"] = df["Price"].shift()
# diff is where we will evaluate the dataframe. Where this diff is positive within
# a product ID is the rows you want to focus on.
mod_df["diff"] = mod_df["Price"] - mod_df["shift"]

# Function for doing the heavy lifting
def pull_values_w_price_less_than(row: pd.Series, column: str):
    # The case that the price decreased or stayed the same
    if row["diff"] <= 0:
        return pd.NA
    # Price increased (where we want to eval)
    else:
        # Entries relevant to this row
        relevant_df = mod_df[
            # Product ID is the same
            (mod_df["ID"] == row["ID"])
            # Price is less than this one
            & (mod_df["Price"] < row["Price"])
            # And it happened before this one
            & (mod_df["Date"] < row["Date"])
        ]

        # The matched values (be it Price or Date) determined by the passed arg
        # column placed in a list
        matched_values = relevant_df[column].to_list()

        # If nothing was matched (this happens at ID boundaries where the price
        # went down so we are in this region, but we don't want to return an
        # empty list)
        if len(matched_values) == 0:
            return pd.NA
        # Return the values to put in the row
        return matched_values

# Do this for the prices
mod_df["Matched_Price"] = mod_df.apply(
    pull_values_w_price_less_than, args=("Price",), axis=1
)
# And the dates
mod_df["Date_Values"] = mod_df.apply(
    pull_values_w_price_less_than, args=("Date",), axis=1
)

# Remove our helper columns
output = mod_df.drop(columns=["shift", "diff"])

# View the df (optional ofc)
output

我还将值放入列表而不是逗号分隔的字符串中,但您可以使用此函数轻松转换这些列:

def list_to_comma_string(value_list) -> str:
    # If it was already null we cant change that
    if value_list is pd.NA:
        return pd.NA
    
    # list comprehension to change all the values to strings in the list
    joiner = [str(x) for x in value_list]

    # The most efficient way to combine strings in python
    return ','.join(joiner)

# Convert both columns
output['Matched_Price'] = output['Matched_Price'].apply(list_to_comma_string)
output['Date_Values'] = output['Date_Values'].apply(list_to_comma_string)

# Again view the df
output

希望这有帮助!

相关问题