csv Python程序,用于从多个样本菌种计数文件生成单个菌种矩阵文件(使用Pandas?)

uurity8g  于 2023-03-10  发布在  Python
关注(0)|答案(2)|浏览(70)

假设我有两个文件列出了物种数量,如下所示:
样本1_sc.tsv

Sample_1 | Sample ID
464 | Bacillus subtilis
116 | Escherichia coli
62 | Vibrio cholerae serotype 1

样本2_sc.tsv

Sample_2 | Sample ID
364 | Bacillus subtilis
120 | Homo sapiens
16 | Yersinia pestis
16 | Danio rerio

是否有使用 Dataframe 连接两个数据文件的功能,以便标题包含所有样本文件的菌种(无重复),行数为样本数,每行显示该样本中菌种的读取计数,如果菌种不在样本中,则为0?
对于上面的例子,我希望物种矩阵看起来像这样:

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16

我不熟悉Pandas,所以这里的代码我已经尝试到目前为止:

import pandas as pd
import numpy as np
import glob

path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []

for file in sc_files:
  df_sample = pd.read_csv(file, sep = '\t')
  df_sample = df_sample.set_index("SampleID")
  df_sample = df_sample.transpose()
  df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
  df_sc.append(df_sample)

df_matrix = pd.concat(df_sc, axis = 1).fillna(0)

这是我得到的结果

SampleID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0

如何使包含特定菌种名称(本例中为枯草芽孢杆菌)的所有样本计数出现在同一列中?
我尝试删除df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
但不管我保留还是移除它,结果都一样

sy5wg1nm

sy5wg1nm1#

我会这样使用concat,(小心轴!):

dfs = [df1, df2]

out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
         .fillna(0, downcast='infer').T
       )

输出:

Sample ID  Bacillus subtilis  Escherichia coli  Vibrio cholerae serotype 1  Homo sapiens  Yersinia pestis  Danio rerio
Sample_1                 464               116                          62             0                0            0
Sample_2                 364                 0                           0           120               16           16
d7v8vwbk

d7v8vwbk2#

您还可以使用pd.concat()pd.pivot_table()

df = pd.concat([df1, df2])
df = (df.pivot_table(index='Sample ID', columns='Species', values='Count', fill_value=0)
        .reset_index())
Species Sample ID  Bacillus subtilis  Danio rerio  Escherichia coli  Homo sapiens  Vibrio cholerae serotype 1  Yersinia pestis
0        Sample_1                464            0               116             0                          62                0
1        Sample_2                364           16                 0           120                           0               16

注意输出中的列order发生了变化,因为默认情况下df.pivot_table按字母顺序对列进行排序。

相关问题