pandas 如何将含有相同分数的行相加而不产生冗余?

hgqdbh6s  于 2023-03-11  发布在  其他
关注(0)|答案(2)|浏览(99)

我有一个包含多个列的文件,如下所示:

Ko_class  FPKM  count
0                       Carbon;Pyruvate;vitamins  16.7      1
1  Pyruvate;Carbohydrate;Pentose and glucuronate  30.0      7
2                         Lipid;Carbon;Galactose  40.5      9
3                  Galactose;Pyruvate;Fatty acid  57.0     10
4                               Fatty acid;Lipid  22.0      4

我想根据target_listalls[FPKM]alls[count]的行值求和,同时所有的值只能求和一次。

这是我的数据

# coding=utf-8
import pandas as pd
import numpy as np

classes = [('Carbon;Pyruvate;vitamins', 16.7, 1),
         ('Pyruvate;Carbohydrate;Pentose and glucuronate', 30, 7),
         ('Lipid;Carbon;Galactose', 40.5, 9),
         ('Galactose;Pyruvate;Fatty acid', 57, 10),
         ('Fatty acid;Lipid', 22, 4)]
labels = ['Ko_class','FPKM', 'count']
alls = pd.DataFrame.from_records(classes, columns=labels)
print(alls)

target_list = ['Carbon','Carbon; Pyruvate','Galactose; Pyruvate; Carbon',
               'Galactose;Pyruvate;Fatty acid; Carbon']

这是我想要的

Ko_class   FPKM  count
0                                 Carbon   57.2     10
1                       Carbon; Pyruvate  144.2     27
2            Galactose; Pyruvate; Carbon  144.2     27
3  Galactose;Pyruvate;Fatty acid; Carbon  166.2     31

如您所见,尽管Carbon; PyruvateGalactose; Pyruvate; Carbon行不相同,但FPKMcount值可能相同。
希望有人能帮助我:(

mwyxok5s

mwyxok5s1#

输入

classes = [('Carbon;Pyruvate;vitamins', 16.7, 1),
         ('Pyruvate;Carbohydrate;Pentose and glucuronate', 30, 7),
         ('Lipid;Carbon;Galactose', 40.5, 9),
         ('Galactose;Pyruvate;Fatty acid', 57, 10),
         ('Fatty acid;Lipid', 22, 4)]
labels = ['Ko_class','FPKM', 'count']
alls = pd.DataFrame.from_records(classes, columns=labels)
target_list = ['Carbon','Carbon; Pyruvate','Galactose; Pyruvate; Carbon', 'Galactose;Pyruvate;Fatty acid; Carbon']

创建掩码以过滤ko_class中的所有类,并与目标列表中的所有类进行比较

result = pd.DataFrame()
for target in target_list:
    mask = alls.apply(lambda x: any([cls in target for cls in x['Ko_class'].split(';')]), axis=1) #here the mask compare class
    target_sum = alls.loc[mask, ["FPKM", 'count']].sum().reset_index().rename(columns={0:target}).iloc[:,1:] #sum of each class filtered by target in target list
    result = pd.concat([result,target_sum], axis=1) #concat as result

然后将其转置并重命名列。

result = result.T
result.columns = ['FPKM', 'Count']

产出

FPKM    Count
Carbon                                   57.2   10.0
Carbon; Pyruvate                        144.2   27.0
Galactose; Pyruvate; Carbon             144.2   27.0
Galactose;Pyruvate;Fatty acid; Carbon   166.2   31.0
sqxo8psd

sqxo8psd2#

在循环中使用concat的集合运算:

tmp = alls.assign(target=[set(re.split(r'\s*;\s*', s)) for s in alls['Ko_class']])

t = target_set[0]

out = pd.concat({label: tmp.loc[[x.issuperset(t) for x in tmp['target']], ['FPKM', 'count']].sum()
 for label, t in zip(target_list, target_set)
}).unstack().rename_axis('Ko_class').reset_index()

输出:

Ko_class  FPKM  count
0                                 Carbon  57.2   10.0
1                       Carbon; Pyruvate  16.7    1.0
2            Galactose; Pyruvate; Carbon   0.0    0.0
3  Galactose;Pyruvate;Fatty acid; Carbon   0.0    0.0

相关问题