pandas 如果在2+个其他列中满足条件,如何用值填充列

hwamh0ep  于 2023-02-20  发布在  其他
关注(0)|答案(4)|浏览(183)

我的数据框与下表类似,有6列,如果使用了特定的抗生素,每列都有“是”或“否”。
| 方位|临床|CFTX|地铁|CFTN|多西|治疗|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|
| 是的|是的|没有|没有|没有|没有||
| 没有|是的|没有|是的|没有|没有||
| 是的|是的|没有|没有|没有|没有||
| 没有|没有|没有|没有|没有|没有||
| 是的|是的|是的|是的|是的|是的||
| 没有|是的|是的|是的|没有|是的||
| 没有|没有|没有|没有|没有|是的||
| 没有|没有|没有|没有|没有|没有||
| 是的|是的|是的|没有|没有|没有||
| 是的|没有|是的|是的|没有|没有||
| 是的|没有|没有|没有|是的|没有||
| 没有|没有|是的|是的|没有|是的||
| 没有|没有|没有|没有|是的|是的||
| 没有|没有|没有|是的|没有|是的||
如果抗生素列的特定组合包含“是”,我想用“真”填充列“已治疗”。如果不满足条件,我想用“假”值填充“已治疗”列。
如果['方位'] &['临床'] == '是'|
[“方位”] &[“CFTX”] &[“临床”] ==“是”|
[“方位”] &[“CFTX”] &[“地铁”]==“是”|
['方位'] &['CFTN'] == '是'|
['CFTX'] &['DOXY'] &['METRO']== '是'|
['CFTN'] &['DOXY'] == '是'|
['DOXY'] &['METRO']== '是',
然后在列“TREATED”中返回“True”
否则为“假”
我脑子里想的是某种if语句或lambda函数的使用,然而,我遇到了麻烦。
这不能仅限于上述组合,还应包括所有6种药物给药的情况。如果是这种情况,则应返回“True”,因为已满足至少给予2种治疗药物的条件。
所需输出如下:
| 方位|临床|CFTX|地铁|CFTN|多西|治疗|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|
| 是的|是的|没有|没有|没有|没有|是的|
| 没有|是的|没有|是的|没有|没有|没有|
| 是的|是的|没有|没有|没有|没有|是的|
| 没有|没有|没有|没有|没有|没有|没有|
| 是的|是的|是的|是的|是的|是的|是的|
| 没有|是的|是的|是的|没有|是的|是的|
| 没有|没有|没有|没有|没有|是的|没有|
| 没有|没有|没有|没有|没有|没有|没有|
| 是的|是的|是的|没有|没有|没有|是的|
| 是的|没有|是的|是的|没有|没有|是的|
| 是的|没有|没有|没有|是的|没有|是的|
| 没有|没有|是的|是的|没有|是的|是的|
| 没有|没有|没有|没有|是的|是的|是的|
| 没有|没有|没有|是的|没有|是的|是的|

kiayqfof

kiayqfof1#

使用您提供的 Dataframe :

import pandas as pd

df = pd.DataFrame(
    {
        "AZITH": [
            "Yes",
            "No",
            "Yes",
            "No",
            "Yes",
            "No",
            "No",
            "No",
            "Yes",
            "Yes",
            "Yes",
            "No",
            "No",
            "No",
        ],
        "CLIN": [
            "Yes",
            "Yes",
            "Yes",
            "No",
            "Yes",
            "Yes",
            "No",
            "No",
            "Yes",
            "No",
            "No",
            "No",
            "No",
            "No",
        ],
        "CFTX": [
            "No",
            "No",
            "No",
            "No",
            "Yes",
            "Yes",
            "No",
            "No",
            "Yes",
            "Yes",
            "No",
            "Yes",
            "No",
            "No",
        ],
        "METRO": [
            "No",
            "Yes",
            "No",
            "No",
            "Yes",
            "Yes",
            "No",
            "No",
            "No",
            "Yes",
            "No",
            "Yes",
            "No",
            "Yes",
        ],
        "CFTN": [
            "No",
            "No",
            "No",
            "No",
            "Yes",
            "No",
            "No",
            "No",
            "No",
            "No",
            "Yes",
            "No",
            "Yes",
            "No",
        ],
        "DOXY": [
            "No",
            "No",
            "No",
            "No",
            "Yes",
            "Yes",
            "Yes",
            "No",
            "No",
            "No",
            "No",
            "Yes",
            "Yes",
            "Yes",
        ],
    }
)

下面是一种方法:

mask = (
    ((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes"))
    | ((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes") & (df["CFTX"] == "Yes"))
    | ((df["AZITH"] == "Yes") & (df["CFTX"] == "Yes") & (df["METRO"] == "Yes"))
    | ((df["AZITH"] == "Yes") & (df["CFTN"] == "Yes"))
    | ((df["CFTX"] == "Yes") & (df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
    | ((df["CFTN"] == "Yes") & (df["DOXY"] == "Yes"))
    | ((df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
)
df.loc[mask, "TREATED"] = "Yes"
df = df.fillna("No")

然后:

print(df)
# Output
   AZITH CLIN CFTX METRO CFTN DOXY TREATED
0    Yes  Yes   No    No   No   No     Yes
1     No  Yes   No   Yes   No   No      No
2    Yes  Yes   No    No   No   No     Yes
3     No   No   No    No   No   No      No
4    Yes  Yes  Yes   Yes  Yes  Yes     Yes
5     No  Yes  Yes   Yes   No  Yes     Yes
6     No   No   No    No   No  Yes      No
7     No   No   No    No   No   No      No
8    Yes  Yes  Yes    No   No   No     Yes
9    Yes   No  Yes   Yes   No   No     Yes
10   Yes   No   No    No  Yes   No     Yes
11    No   No  Yes   Yes   No  Yes     Yes
12    No   No   No    No  Yes  Yes     Yes
13    No   No   No   Yes   No  Yes     Yes
ny6fqffe

ny6fqffe2#

这有点抽象,但您可以使用位标志来表示每个Yes(True),并为其分配一个二进制值,然后基于if语句进行威胁运算。
https://dietertack.medium.com/using-bit-flags-in-c-d39ec6e30f08

yvfmudvl

yvfmudvl3#

您可以使用集合运算,首先聚合为给定药物的集合,然后检查所有可能的组合(如果有超集):

valid_treatments = [{'AZITH', 'CLIN'}, {'AZITH', 'CFTX', 'CLIN'},
                    {'AZITH', 'CFTX', 'METRO'}, {'AZITH', 'CFTN'},
                    {'CFTX', 'DOXY', 'METRO'}, {'CFTN', 'DOXY'},
                    {'DOXY', 'METRO'},
                   ]

def is_valid(row):
    combination = set(df.columns[row])
    return 'Yes' if any(
               combination.issuperset(v)
               for v in valid_treatments
           ) else 'No'

out = df.assign(TREATED=df.eq('Yes').apply(is_valid, axis=1))

输出:

AZITH CLIN CFTX METRO CFTN DOXY TREATED
0    Yes  Yes   No    No   No   No     Yes
1     No  Yes   No   Yes   No   No      No
2    Yes  Yes   No    No   No   No     Yes
3     No   No   No    No   No   No      No
4    Yes  Yes  Yes   Yes  Yes  Yes     Yes
5     No  Yes  Yes   Yes   No  Yes     Yes
6     No   No   No    No   No  Yes      No
7     No   No   No    No   No   No      No
8    Yes  Yes  Yes    No   No   No     Yes
9    Yes   No  Yes   Yes   No   No     Yes
10   Yes   No   No    No  Yes   No     Yes
11    No   No  Yes   Yes   No  Yes     Yes
12    No   No   No    No  Yes  Yes     Yes
13    No   No   No   Yes   No  Yes     Yes
ubby3x7f

ubby3x7f4#

我的答案是试图将这个解向量化。但是我从Mozway那里得到了超集的想法。在此之前,我不知道如何处理所有的组合

import numpy as np
import pandas as pd
import itertools

df = pd.DataFrame({ "AZITH": ["Yes","No","Yes","No","Yes","No","No","No","Yes","Yes","Yes","No","No","No",],
"CLIN": ["Yes","Yes","Yes","No","Yes","Yes","No","No","Yes","No","No","No","No","No",],
"CFTX": ["No","No","No","No","Yes","Yes","No","No","Yes","Yes","No","Yes","No","No",],
"METRO": ["No","Yes","No","No","Yes","Yes","No","No","No","Yes","No","Yes","No","Yes",],
"CFTN": ["No","No","No","No","Yes","No","No","No","No","No","Yes","No","Yes","No",],
"DOXY": ["No","No","No","No","Yes","Yes","Yes","No","No","No","No","Yes","Yes","Yes",]})
combos = np.array([[1,1,0,0,0,0],[1,1,1,0,0,0],[1,0,1,1,0,0],[1,0,0,0,1,0],[0,0,1,1,0,1],[0,0,0,0,1,1],[0,0,0,1,0,1]])
df = df.replace("Yes",1)
df = df.replace("No",0)
c = []
for l in range(len(combos)):
    c.extend(itertools.combinations(range(len(combos)),l))
all_combos = combos
for combo in c[1:]:
    combined = np.sum(combos[combo,:],axis=0)
    all_combos = np.vstack([all_combos,combined])

all_combos[all_combos!=0]=1
all_combos = np.unique(all_combos,axis=0)
combo_sum = all_combos.sum(axis=1)
all_combos[all_combos==0]=-1
new_df = df.dot(all_combos.transpose())
for i,x in enumerate(combo_sum):
    new_df.loc[new_df[i]<x,i] = 0

new_df[new_df>0]=1
new_df["res"] = new_df.sum(axis=1)
new_df.loc[new_df.res>0,"res"] = True
new_df.loc[new_df.res==0,"res"] = False
df["res"] = new_df["res"]
AZITH  CLIN  CFTX  METRO  CFTN  DOXY    res
0       1     1     0      0     0     0   True
1       0     1     0      1     0     0  False
2       1     1     0      0     0     0   True
3       0     0     0      0     0     0  False
4       1     1     1      1     1     1   True
5       0     1     1      1     0     1  False
6       0     0     0      0     0     1  False
7       0     0     0      0     0     0  False
8       1     1     1      0     0     0   True
9       1     0     1      1     0     0   True
10      1     0     0      0     1     0   True
11      0     0     1      1     0     1   True
12      0     0     0      0     1     1   True
13      0     0     0      1     0     1   True

代码的一般解释是,我创建了一个numpy数组,其中包含所有组合,包括可接受的组合(两个或更多组合的总和)。

np.unique(all_combos,axis=0)
Out[38]: 
array([[0, 0, 0, 0, 1, 1],
       [0, 0, 0, 1, 0, 1],
       [0, 0, 0, 1, 1, 1],
       [0, 0, 1, 1, 0, 1],
       [0, 0, 1, 1, 1, 1],
       [1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 1, 1],
       [1, 0, 0, 1, 1, 1],
       [1, 0, 1, 1, 0, 0],
       [1, 0, 1, 1, 0, 1],
       [1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 1, 1],
       [1, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0],
       [1, 1, 0, 0, 1, 1],
       [1, 1, 0, 1, 0, 1],
       [1, 1, 0, 1, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 0, 1, 0],
       [1, 1, 1, 0, 1, 1],
       [1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0, 1],
       [1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1]])

任何不属于组合的额外药物都将通过在组合列表中将值设置为-1来进行惩罚。(如果不惩罚额外药物,则不需要超集,您只需与原始组合变量的总和进行比较。)
然后,在数据集和所有组合的集合之间进行点积,并将值与组合的总和进行比较(在用-1替换0之前)。这意味着如果值为3,组合的预期结果为3,则它是有效的组合。以下是组合的总和(作为数组
一个三个三个一个
在点积之后,我们用1替换有效值,用0替换无效值(小于预期和)。我们对所有组合求和,看看是否有有效的组合。如果组合之和〉= 1,则至少有一个组合是有效的。否则,所有组合都是无效的。

ipdb> new_df
    0  1  2  3  4  5  6  7  8  9  ...  15  16  17  18  19  20  21  22  23  res
0   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    1
1   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    0
2   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    1
3   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    0
4   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   1    1
5   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    0
6   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    0
7   0  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    0
8   0  0  0  0  0  0  0  0  0  0  ...   0   0   1   0   0   0   0   0   0    1
9   0  0  0  0  0  0  0  0  1  0  ...   0   0   0   0   0   0   0   0   0    1
10  0  0  0  0  0  1  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    1
11  0  0  0  1  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    1
12  1  0  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    1
13  0  1  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   0    1

将最终求和列替换为True或False,并应用于原始 Dataframe 。

相关问题