dataframe值是使用lambda函数的日期之间的和

nbysray5  于 2021-09-29  发布在  Java
关注(0)|答案(3)|浏览(400)

我有两个 Dataframe df1和df2。df1是一个统一的日期范围,另一列中的默认值为0。所以它看起来像:

df1: 
     date          Val
0    2020-02-01     0
1    2020-02-02     0
2    2020-02-03     0
.
.
.

同时,df2只记录发生的事件

df2: 
     date                   Val
0    2020-02-01-10:00:00     98
1    2020-02-01-13:54:09     55
2    2020-02-02-11:33:17     32         
.
.
.

我想要这个 df1.val[i] 是两个日期之间df2中所有值的总和 df1.date[i] and df1.date[i+1] .
我编写了以下lambda函数,但它不起作用:

df1['val'] = df1.apply( lambda row: df2[ (df2.date < df1.date[row.index])].sum() )

帮助

busg9geu

busg9geu1#

使用 pd.cut :

>>> df2.groupby(pd.cut(df2['date'], bins=df1['date'], labels=df1['date'][:-1])) \
       ['Val'].sum().reset_index()

        date  Val
0 2020-02-01  153
1 2020-02-02   32
l0oc07j2

l0oc07j22#

一些带有解释的代码。我知道有一些更快更容易的方法,但这一种很容易理解。只需将虚拟数据替换为您的数据。

import pandas as pd
import numpy as np

if __name__ ==  "__main__" :
    # create pandas dataframe with custom series input with column names date and Val
    df1 = pd.DataFrame({'date': 
                        pd.date_range(start='1/1/2000', end='31/12/2000',periods=8, normalize=True), 
                        'Val': np.zeros(8)})
    df2 = pd.DataFrame({'date': 
                        pd.date_range(start='1/1/2000', end='31/12/2000', periods=8, normalize=True), 
                        'Val': np.random.randint(0, 100, 8)})

    # I want the df1.val[i] to be the sum of all the values in df2 between the dates df1.date[i] and df1.date[i+1] .
    # I want the output to be in a new dataframe with column names date and sumVal.
    # The output dataframe should only have the rows for which there is a match between the dates in df1[i] and df1[i+1].
    # The output dataframe should be sorted by date.
    # The output dataframe should have the same number of rows as df1.
    # The output dataframe should have the same number of columns as df1.
    # The output dataframe should have the same index as df1.
    print('-' * 50)
    print('df1')
    print(df1)
    print('-' * 50)
    print('df2')
    print(df2)

    for i in range(0, len(df1) - 1):
        # add sum of all the values in df2 bfore the end date of df1 (i.e. i + 1)
        df1.loc[i, 'Val'] += df2.loc[df2['date']
                                    <= df1['date'][i+1], 'Val'].sum()
        # remove all sums from below date threshold (i.e. values under i)
        df1.loc[i, 'Val'] += df2.loc[df2['date']
                                     < df1['date'][i], 'Val'].sum()

    print('-' * 50)
    print('df1')
    print(df1)
pkbketx9

pkbketx93#

我可以推荐duckdb执行以下任务:

import pandas as pd
import duckdb

df1 = pd.DataFrame()
df2 = pd.DataFrame()

df1['date'] = ['2020-02-01', '2020-02-02', '2020-02-03']
df1['Val'] = [0,0,0]
df1['date'] = pd.to_datetime(df1['date'])
df1.loc[0:len(df1)-2, 'date_2'] = df1.loc[1:len(df1)-1, 'date'].values

df2['date'] = ['2020-02-01-10:00:00', '2020-02-01-13:54:09', '2020-02-02-11:33:17 ']
df2['Val'] = [98,55,32]
df2['date'] = pd.to_datetime(df2['date'])

test_df = duckdb.query(
    """select t.date, sum(t.Val) as Val
    from (
       select t1.date, t2.Val, t2.date as date_validation 
       from df1 as t1 
       left join df2 as t2 on t1.date < t2.date and t1.date_2 >= t2.date
       ) t 
       group by t.date
       order by t.date"""
).to_df()

相关问题