确定谁拥有给定发货人80%以上的股份

5vf7fwbs  于 2022-10-23  发布在  其他
关注(0)|答案(4)|浏览(159)

我有个问题。我想调查我的假设是否正确,可以说receiver X经常收到consignor Y的邮件。为此,我将receiverconsignor分组。
然而,我现在想做的是,我想将所有receivers输出给自己,其中最频繁的consignor包含receiver中所有m1n 6o1p的80%以上。
例如,我们有receiver c,它从consignor 2, 3收到了什么。consignor 2仅1个数据包,consignor 3 4个数据包。因此,他总共收到了5个包。因此,consignor 3接受receiver c中所有consignor的80%。因此receiver c应在表中表示。这应该在下面为所有其他人完成。
简单地说,我希望所有receivers都具有给定consignor的80%以上。
我该怎么做?
我想要什么

receiver  consignor  count_with_most_consignor  count_all
       c          3                          4          5
       d          2                          4          5

Dataframe

consignor receiver
0           1        a
1           2        a
2           1        a
3           2        c
4           3        c
5           1        b
6           3        c
7           3        c
8           3        c
9           2        d
10          2        d
11          2        d
12          1        d
13          2        d

密码

import pandas as pd
d = {'consignor': [1, 2, 1, 2, 3, 1, 3, 3, 3, 2, 2, 2, 1, 2],  
     'receiver': ['a', 'a', 'a', 'c', 'c', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd']}
df = pd.DataFrame(data=d)
print(df)
df_  = df.groupby(['receiver','consignor']).size().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
print(df_)

X = 'c'
print(df_[df_['receiver'] == X].shape[1])
print(df_[df_['receiver'] == X])

出来

[OUT]
3
  receiver  consignor  counts
4        c          3       4
3        c          2       1
fzwojiic

fzwojiic1#

crosstab


# frequency table

s = pd.crosstab(df['receiver'], df['consignor'])

# agg columns along axis=1

d = {
    'consignor': s.idxmax(1),
    'count_with_most_consignor': s.max(1),
    'count_all': s.sum(1)
}

(
    s.assign(**d) # assign the agg columns
    .drop(s.columns, axis=1) # drop the unwanted columns
    .query('count_with_most_consignor / count_all >= .8') # filter the rows
)

后果

consignor  consignor  count_with_most_consignor  count_all
receiver                                                  
b                  1                          1          1
c                  3                          4          5
d                  2                          4          5
rkue9o1l

rkue9o1l2#

您只需使用value_counts(normalize=True)获取所有共享

In [13]: df.groupby("receiver").value_counts(normalize=True)
Out[13]: 
receiver  consignor
a         1            0.666667
          2            0.333333
b         1            1.000000
c         3            0.800000
          2            0.200000
d         2            0.800000
          1            0.200000
dtype: float64

然后过滤0.8以上的值,例如:

In [15]: res = df.groupby("receiver").value_counts(normalize=True).where(lambda x: x > 0.8).dropna()
Out[15]: 
receiver  consignor
b         1            1.0
c         3            0.8
d         2            0.8
dtype: float64

获取其余信息的最简单方法是分别计算它们

In [40]: count_with_most_consignor = df.value_counts().rename("count_with_most_consignor")
Out[41]: 
consignor  receiver
2          d           4
3          c           4
1          a           2
           b           1
           d           1
2          a           1
           c           1
Name: count_with_most_consignor, dtype: int64

In [42]: count_all = df.groupby("receiver").size().rename("count_all")

In [43]: count_all
Out[43]: 
receiver
a    3
b    1
c    5
d    5
Name: count_all, dtype: int64

并将其连接在一起。

In [39]: res.rename("share").to_frame().join(count_with_most_consignor).join(count_all).reset_index()
Out[39]: 
  receiver  consignor  share  count_with_most_consignor  count_all
0        b          1    1.0                          1          1
1        c          3    0.8                          4          5
2        d          2    0.8                          4          5
goucqfw6

goucqfw63#

我只能想出这种棘手的方法:)


# First, we create a group column

df["receiver+consignor"] = df["receiver"] + "-" + df["consignor"].astype("str")

# Then we count "receiver+consignor" values

df = df.merge(df["receiver+consignor"].value_counts(), left_on="receiver+consignor", right_index=True, suffixes=(None, "_count"))

# Now we count deliveries by receivers

df = df.merge(df["receiver"].value_counts(), left_on="receiver", right_index=True, suffixes=(None, "_count"))

# Now we can calculate the percentage

df["percentage"] = df["receiver+consignor_count"] * 100 / df["receiver_count"]

# At last, we can filter out what we need now

df_result = df[df["percentage"] >= 80].groupby(["receiver", "consignor", "receiver+consignor_count", "receiver_count"]).first().reset_index()[["receiver", "consignor", "receiver+consignor_count", "receiver_count"]]


小时

woobm2wo

woobm2wo4#

字典的简单解决方案如下:

import pandas as pd
d = {'consignor': [1, 2, 1, 2, 3, 1, 3, 3, 3, 2, 2, 2, 1, 2],
     'receiver': ['a', 'a', 'a', 'c', 'c', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd']}
df = pd.DataFrame(data=d)
df_  = df.groupby(['receiver','consignor']).size().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
df__ = df_[['receiver','counts']]
df___ = df__.groupby(['receiver']).aggregate('sum')
dic = df___.to_dict()
for row in range(len(df_)):
    if df_['counts'][row] / dic['counts'][df_['receiver'][row]] < 0.8:
        df_ = df_.drop(row)
print(df_)

输出:

receiver  consignor  counts
4        c          3       4
6        d          2       4
2        b          1       1

相关问题