将pandas索引与另一个pandas Dataframe 中的任何行值匹配

fd3cxomn  于 2023-06-28  发布在  其他
关注(0)|答案(3)|浏览(93)

我想检索mrna_kirp的行,其中mrna_kirp的索引与gmt_c4 Dataframe 中任何位置的值匹配。

mrna_subset = mrna_kirp.loc[mrna_kirp.index.isin(gmt_c4)]

根据API,我的代码只返回索引和列标签都匹配的匹配项。但我想检索所有可能的匹配。
输入:
gmt_c4.iloc[0:5,0:5]

pd.DataFrame({'MORF_ATRX': {('MORF_BCL2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2',
   'ADCY3',
   'SYT5',
   'LTBP4',
   'A1BG',
   'AQP5',
   'AQP7'): 'TMEM11',
  ('MORF_BNIP1',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BNIP1',
   'PVR',
   'ADCY3',
   'BMP10',
   'NRTN',
   'S100A5',
   'IL16'): 'SYT5',
  ('MORF_BCL2L11',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11',
   'LORICRIN',
   'PVR',
   'A2BP1',
   'FGF18',
   'BMP10',
   'F2RL3'): 'NRTN',
  ('MORF_CCNF',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF',
   'A1CF',
   'EIF5B',
   'TMEM11',
   'EEF1AKMT3',
   'PEX3',
   'HMGN4'): 'GTSE1',
  ('MORF_ERCC2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ERCC2',
   'SEC31A',
   'BTD',
   'GRIK5',
   'EIF5B',
   'TMEM11',
   'BPHL'): 'HNRNPL'},
 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ATRX': {('MORF_BCL2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2',
   'ADCY3',
   'SYT5',
   'LTBP4',
   'UTRN',
   'AQP5',
   'AQP7'): 'KIFC3',
  ('MORF_BNIP1',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BNIP1',
   'PVR',
   'ADCY3',
   'BMP10',
   'NRTN',
   'S100A5',
   'IL16'): 'LTBP4',
  ('MORF_BCL2L11',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11',
   'LORICRIN',
   'PVR',
   'KLRC4',
   'FGF18',
   'BMP10',
   'F2RL3'): 'S100A5',
  ('MORF_CCNF',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF',
   'BMS1',
   'EIF5B',
   'TMEM11',
   'EEF1AKMT3',
   'PEX3',
   'HMGN4'): 'HNRNPL',
  ('MORF_ERCC2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ERCC2',
   'SEC31A',
   'BTD',
   'GRIK5',
   'EIF5B',
   'TMEM11',
   'BPHL'): 'MUTYH'},
 'ADCY3': {('MORF_BCL2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2',
   'ADCY3',
   'SYT5',
   'LTBP4',
   'UTRN',
   'AQP5',
   'AQP7'): 'HTR1B',
  ('MORF_BNIP1',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BNIP1',
   'PVR',
   'ADCY3',
   'BMP10',
   'NRTN',
   'S100A5',
   'IL16'): 'FIG4',
  ('MORF_BCL2L11',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11',
   'LORICRIN',
   'PVR',
   'KLRC4',
   'FGF18',
   'BMP10',
   'F2RL3'): 'IL16',
  ('MORF_CCNF',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF',
   'BMS1',
   'EIF5B',
   'TMEM11',
   'EEF1AKMT3',
   'PEX3',
   'HMGN4'): 'PLEKHB1',
  ('MORF_ERCC2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ERCC2',
   'SEC31A',
   'BTD',
   'GRIK5',
   'EIF5B',
   'TMEM11',
   'BPHL'): 'TAF5L'},
 'SEC31A': {('MORF_BCL2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2',
   'ADCY3',
   'SYT5',
   'LTBP4',
   'UTRN',
   'AQP5',
   'AQP7'): 'DDX11',
  ('MORF_BNIP1',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BNIP1',
   'PVR',
   'ADCY3',
   'BMP10',
   'NRTN',
   'S100A5',
   'IL16'): 'CYP2D6',
  ('MORF_BCL2L11',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11',
   'LORICRIN',
   'PVR',
   'KLRC4',
   'FGF18',
   'BMP10',
   'F2RL3'): 'SLC6A2',
  ('MORF_CCNF',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF',
   'BMS1',
   'EIF5B',
   'TMEM11',
   'EEF1AKMT3',
   'PEX3',
   'HMGN4'): 'PIGF',
  ('MORF_ERCC2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ERCC2',
   'SEC31A',
   'BTD',
   'GRIK5',
   'EIF5B',
   'TMEM11',
   'BPHL'): 'AGPS'},
 'BTD': {('MORF_BCL2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2',
   'ADCY3',
   'SYT5',
   'LTBP4',
   'UTRN',
   'AQP5',
   'AQP7'): 'AGPS',
  ('MORF_BNIP1',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BNIP1',
   'PVR',
   'ADCY3',
   'BMP10',
   'NRTN',
   'S100A5',
   'IL16'): 'GRIK5',
  ('MORF_BCL2L11',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11',
   'LORICRIN',
   'PVR',
   'KLRC4',
   'FGF18',
   'BMP10',
   'F2RL3'): 'MASP2',
  ('MORF_CCNF',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF',
   'BMS1',
   'EIF5B',
   'TMEM11',
   'EEF1AKMT3',
   'PEX3',
   'HMGN4'): 'TPP2',
  ('MORF_ERCC2',
   'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ERCC2',
   'SEC31A',
   'BTD',
   'GRIK5',
   'EIF5B',
   'TMEM11',
   'BPHL'): 'SFSWAP'}})

mrna_kirp.iloc[0:4,0:4]

pd.DataFrame({'TCGA.2K.A9WE.01': {'A1BG': 391.94,
  'A1CF': 8.0,
  'A2BP1': 1.0,
  'A2LD1': 159.46},
 'TCGA.2Z.A9J1.01': {'A1BG': 68.91,
  'A1CF': 75.0,
  'A2BP1': 0.0,
  'A2LD1': 247.06},
 'TCGA.2Z.A9J3.01': {'A1BG': 71.9,
  'A1CF': 28.0,
  'A2BP1': 33.0,
  'A2LD1': 516.7},
 'TCGA.2Z.A9J5.01': {'A1BG': 325.6,
  'A1CF': 47.0,
  'A2BP1': 4.0,
  'A2LD1': 151.49}})

所需输出:

pd.DataFrame({'TCGA.2K.A9WE.01': {'A1BG': 391.94,
  'A1CF': 8.0,
  'A2BP1': 1.0},
 'TCGA.2Z.A9J1.01': {'A1BG': 68.91,
  'A1CF': 75.0,
  'A2BP1': 0.0},
 'TCGA.2Z.A9J3.01': {'A1BG': 71.9,
  'A1CF': 28.0,
  'A2BP1': 33.0},
 'TCGA.2Z.A9J5.01': {'A1BG': 325.6,
  'A1CF': 47.0,
  'A2BP1': 4.0}})
ar7v8xwq

ar7v8xwq1#

你可以重置索引,然后将dataframe转换为flatten numpy数组,最后检查索引是否在数组中:

m = mrna_kirp.index.isin(np.hstack([gmt_c4.columns.values, gmt_c4.reset_index().values.ravel()]))
out = mrna_kirp[m]

输出:

>>> out
       TCGA.2K.A9WE.01  TCGA.2Z.A9J1.01  TCGA.2Z.A9J3.01  TCGA.2Z.A9J5.01
A1BG            391.94            68.91             71.9            325.6
A1CF              8.00            75.00             28.0             47.0
A2BP1             1.00             0.00             33.0              4.0

>>> m
array([ True,  True,  True, False])

性能:

# @jezrael solution 1:
>>> %timeit mrna_kirp[mrna_kirp.index.isin(np.unique(gmt_c4.stack().reset_index()))]
2.49 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# @jezrael solution 2:
>>> %timeit mrna_kirp.loc[mrna_kirp.index.intersection(np.unique(gmt_c4.stack().reset_index()), sort=False)]
2.52 ms ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# @Corralien
>>> %timeit mrna_kirp[mrna_kirp.index.isin(np.hstack([gmt_c4.columns.values, gmt_c4.reset_index().values.ravel()]))]
1.62 ms ± 50.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

使用stack的解决方案比使用.values.flatten()慢。

注意:但是,您可能需要修改gmt_c4 Dataframe 的加载方式,而不是检查列名中的值。似乎列名是一个数据记录?如果使用CSV/Excel文件,请使用header=None作为pd.read_csv/pd.read_excel的参数。

ncgqoxb0

ncgqoxb02#

对于gmt_c4MultiIndexcolumnsvalues中的任何值的匹配,使用DataFrame.stackSeries.reset_indexnumpy.unique,并在boolean indexing中进行过滤:

out = mrna_kirp[mrna_kirp.index.isin(np.unique(gmt_c4.stack().reset_index()))]
print (out)
       TCGA.2K.A9WE.01  TCGA.2Z.A9J1.01  TCGA.2Z.A9J3.01  TCGA.2Z.A9J5.01
A1BG            391.94            68.91             71.9            325.6
A1CF              8.00            75.00             28.0             47.0
A2BP1             1.00             0.00             33.0              4.0

详情

print (np.unique(gmt_c4.stack().reset_index()))
['A1BG' 'A1CF' 'A2BP1' 'ADCY3' 'AGPS' 'AQP5' 'AQP7' 'BMP10' 'BMS1' 'BPHL'
 'BTD' 'CYP2D6' 'DDX11' 'EEF1AKMT3' 'EIF5B' 'F2RL3' 'FGF18' 'FIG4' 'GRIK5'
 'GTSE1' 'HMGN4' 'HNRNPL' 'HTR1B' 'IL16' 'KIFC3' 'KLRC4' 'LORICRIN'
 'LTBP4' 'MASP2' 'MORF_ATRX' 'MORF_BCL2' 'MORF_BCL2L11' 'MORF_BNIP1'
 'MORF_CCNF' 'MORF_ERCC2' 'MUTYH' 'NRTN' 'PEX3' 'PIGF' 'PLEKHB1' 'PVR'
 'S100A5' 'SEC31A' 'SFSWAP' 'SLC6A2' 'SYT5' 'TAF5L' 'TMEM11' 'TPP2' 'UTRN'
 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ATRX'
 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2'
 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11'
 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BNIP1'
 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF'
 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ERCC2']

Index.intersectionDataFrame.loc类似:

out = mrna_kirp.loc[mrna_kirp.index.intersection(np.unique(gmt_c4.stack().reset_index()), sort=False)]
print (out)
       TCGA.2K.A9WE.01  TCGA.2Z.A9J1.01  TCGA.2Z.A9J3.01  TCGA.2Z.A9J5.01
A1BG            391.94            68.91             71.9            325.6
A1CF              8.00            75.00             28.0             47.0
A2BP1             1.00             0.00             33.0              4.0

为了提高性能,请使用用途:

vals = np.append(np.ravel(gmt_c4.reset_index()), gmt_c4.columns).astype(str)
out = mrna_kirp[mrna_kirp.index.isin(vals)]

另一个关于集合的想法:

vals = set(np.ravel(gmt_c4.reset_index())).union(gmt_c4.columns)
out = mrna_kirp[mrna_kirp.index.isin(vals)]

对于数据的完整性性能(真实的上应该是不同的):

In [29]: %timeit mrna_kirp[mrna_kirp.index.isin(np.unique(gmt_c4.stack().reset_index()))]
3.1 ms ± 95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [30]: %timeit mrna_kirp.loc[mrna_kirp.index.intersection(np.unique(gmt_c4.stack().reset_index()), sort=False)]
3.41 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [31]: %timeit mrna_kirp[mrna_kirp.index.isin(np.append(np.ravel(gmt_c4.reset_index()), gmt_c4.columns).astype(str))]
2.08 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [32]: %timeit mrna_kirp[mrna_kirp.index.isin(set(np.ravel(gmt_c4.reset_index())).union(gmt_c4.columns))]
2.05 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Corralien solution
In [33]: %timeit mrna_kirp[mrna_kirp.index.isin(np.hstack([gmt_c4.columns.values, gmt_c4.reset_index().values.ravel()]))]
2.08 ms ± 73.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9vw9lbht

9vw9lbht3#

下面的代码应该可以工作-

mrna_kirp.apply(lambda x: x if np.isin(x.name, np.array(list(map(list, gmt_c4.index.values)))) else None, axis=1).dropna()

我已经把代码分成了几个小部分,试图更好地理解我是如何开发的。

np.array(list(map(list, gmt_c4.index.values))))

这一行帮助将元组数组转换为ndarray。访问多索引将返回一个元组数组。默认情况下,多索引是这样的-

array([('MORF_BCL2', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2', 'ADCY3', 'SYT5', 'LTBP4', 'A1BG', 'AQP5', 'AQP7'),
       ('MORF_BNIP1', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BNIP1', 'PVR', 'ADCY3', 'BMP10', 'NRTN', 'S100A5', 'IL16'),
       ('MORF_BCL2L11', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11', 'LORICRIN', 'PVR', 'A2BP1', 'FGF18', 'BMP10', 'F2RL3'),
       ('MORF_CCNF', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF', 'A1CF', 'EIF5B', 'TMEM11', 'EEF1AKMT3', 'PEX3', 'HMGN4'),
       ('MORF_ERCC2', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_ERCC2', 'SEC31A', 'BTD', 'GRIK5', 'EIF5B', 'TMEM11', 'BPHL'),
       ('MORF_BCL2', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2', 'ADCY3', 'SYT5', 'LTBP4', 'UTRN', 'AQP5', 'AQP7'),
       ('MORF_BCL2L11', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_BCL2L11', 'LORICRIN', 'PVR', 'KLRC4', 'FGF18', 'BMP10', 'F2RL3'),
       ('MORF_CCNF', 'http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MORF_CCNF', 'BMS1', 'EIF5B', 'TMEM11', 'EEF1AKMT3', 'PEX3', 'HMGN4')],
      dtype=object)

我需要使用np.isin来检查从 * mrna_kirp * 中提取的索引是否作为元组中的值存在。isin不能在元组数组中做我们想要的事情,所以我们将其转换为正确的格式。
我们使用apply函数遍历 * mrna_kirp * Dataframe ,并使用 * row.name * 检索每个行的索引,并为此设置 * axis = 1 *。
最后,当 * mrna_kirp * 的行索引在 * gmt_c4 * 中找不到时,我们使用dropna删除None值。

相关问题