pandas 标识具有相同值的列,并循环访问保存原始表中更改的列

iezvtpos  于 2023-05-21  发布在  其他
关注(0)|答案(1)|浏览(171)

我有一个这样的数据框

Chr Start   Ref Alt Revel_Score RefSeq  CHR BP  SNP P
21  9   133710834   A   C   0.429   NM_005157.6 9   133710834   1   0.571001
22  9   133710834   A   G   0.424   NM_005157.6 9   133710834   1   0.576001
23  9   133710834   A   T   0.432   NM_005157.6 9   133710834   1   0.568001
24  9   133710835   T   A   0.395   NM_005157.6 9   133710835   1   0.605001
25  9   133710835   T   C   0.386   NM_005157.6 9   133710835   1   0.614001
26  9   133710835   T   G   0.389   NM_005157.6 9   133710835   1   0.611001
27  9   133710836   G   A   0.429   NM_005157.6 9   133710836   1   0.571001
28  9   133710836   G   C   0.418   NM_005157.6 9   133710836   1   0.582001
29  9   133710836   G   T   0.418   NM_005157.6 9   133710836   1   0.582001
30  9   133710837   T   A   0.381   NM_005157.6 9   133710837   1   0.619001
31  9   133710837   T   G   0.278   NM_005157.6 9   133710837   1   0.722001
32  9   133710838   T   C   0.378   NM_005157.6 9   133710838   1   0.622001
33  9   133710838   T   G   0.327   NM_005157.6 9   133710838   1   0.673001
34  9   133710839   G   C   0.352   NM_005157.6 9   133710839   1   0.648001
35  9   133710839   G   T   0.352   NM_005157.6 9   133710839   1   0.648001

# The same in a dict to allow to work
IGV_table_limited.head(15).to_dict()

{'Chr': {21: '9',
  22: '9',
  23: '9',
  24: '9',
  25: '9',
  26: '9',
  27: '9',
  28: '9',
  29: '9',
  30: '9',
  31: '9',
  32: '9',
  33: '9',
  34: '9',
  35: '9'},
 'Start': {21: 133710834,
  22: 133710834,
  23: 133710834,
  24: 133710835,
  25: 133710835,
  26: 133710835,
  27: 133710836,
  28: 133710836,
  29: 133710836,
  30: 133710837,
  31: 133710837,
  32: 133710838,
  33: 133710838,
  34: 133710839,
  35: 133710839},
 'Ref': {21: 'A',
  22: 'A',
  23: 'A',
  24: 'T',
  25: 'T',
  26: 'T',
  27: 'G',
  28: 'G',
  29: 'G',
  30: 'T',
  31: 'T',
  32: 'T',
  33: 'T',
  34: 'G',
  35: 'G'},
 'Alt': {21: 'C',
  22: 'G',
  23: 'T',
  24: 'A',
  25: 'C',
  26: 'G',
  27: 'A',
  28: 'C',
  29: 'T',
  30: 'A',
  31: 'G',
  32: 'C',
  33: 'G',
  34: 'C',
  35: 'T'},
 'Revel_Score': {21: 0.429,
  22: 0.424,
  23: 0.432,
  24: 0.395,
  25: 0.386,
  26: 0.389,
  27: 0.429,
  28: 0.418,
  29: 0.418,
  30: 0.381,
  31: 0.278,
  32: 0.378,
  33: 0.327,
  34: 0.352,
  35: 0.352},
 'RefSeq': {21: 'NM_005157.6',
  22: 'NM_005157.6',
  23: 'NM_005157.6',
  24: 'NM_005157.6',
  25: 'NM_005157.6',
  26: 'NM_005157.6',
  27: 'NM_005157.6',
  28: 'NM_005157.6',
  29: 'NM_005157.6',
  30: 'NM_005157.6',
  31: 'NM_005157.6',
  32: 'NM_005157.6',
  33: 'NM_005157.6',
  34: 'NM_005157.6',
  35: 'NM_005157.6'},
 'CHR': {21: '9',
  22: '9',
  23: '9',
  24: '9',
  25: '9',
  26: '9',
  27: '9',
  28: '9',
  29: '9',
  30: '9',
  31: '9',
  32: '9',
  33: '9',
  34: '9',
  35: '9'},
 'BP': {21: 133710834,
  22: 133710834,
  23: 133710834,
  24: 133710835,
  25: 133710835,
  26: 133710835,
  27: 133710836,
  28: 133710836,
  29: 133710836,
  30: 133710837,
  31: 133710837,
  32: 133710838,
  33: 133710838,
  34: 133710839,
  35: 133710839},
 'SNP': {21: 1,
  22: 1,
  23: 1,
  24: 1,
  25: 1,
  26: 1,
  27: 1,
  28: 1,
  29: 1,
  30: 1,
  31: 1,
  32: 1,
  33: 1,
  34: 1,
  35: 1},
 'P': {21: 0.571001,
  22: 0.5760010000000001,
  23: 0.5680010000000001,
  24: 0.605001,
  25: 0.614001,
  26: 0.611001,
  27: 0.571001,
  28: 0.5820010000000001,
  29: 0.5820010000000001,
  30: 0.619001,
  31: 0.722001,
  32: 0.622001,
  33: 0.6730010000000001,
  34: 0.648001,
  35: 0.648001}}

我想做下面这件事
如果Chr、Start、Ref和Revel_Score列中的值相同,则通过添加0.01到0.1之间的随机数来修改P列中的值
因此,条件发生在第28、29和34、35行,所以0.08 + 0.582001、0.06 + 0.582001、0.05 + 0.648001和0.07 + 0.648001也是如此

Chr Start   Ref Alt Revel_Score RefSeq  CHR BP  SNP P
21  9   133710834   A   C   0.429   NM_005157.6 9   133710834   1   0.571001
22  9   133710834   A   G   0.424   NM_005157.6 9   133710834   1   0.576001
23  9   133710834   A   T   0.432   NM_005157.6 9   133710834   1   0.568001
24  9   133710835   T   A   0.395   NM_005157.6 9   133710835   1   0.605001
25  9   133710835   T   C   0.386   NM_005157.6 9   133710835   1   0.614001
26  9   133710835   T   G   0.389   NM_005157.6 9   133710835   1   0.611001
27  9   133710836   G   A   0.429   NM_005157.6 9   133710836   1   0.571001
28  9   133710836   G   C   0.418   NM_005157.6 9   133710836   1    0.662001
29  9   133710836   G   T   0.418   NM_005157.6 9   133710836   1    0.642001
30  9   133710837   T   A   0.381   NM_005157.6 9   133710837   1   0.619001
31  9   133710837   T   G   0.278   NM_005157.6 9   133710837   1   0.722001
32  9   133710838   T   C   0.378   NM_005157.6 9   133710838   1   0.622001
33  9   133710838   T   G   0.327   NM_005157.6 9   133710838   1   0.673001
34  9   133710839   G   C   0.352   NM_005157.6 9   133710839   1    0.698001 
35  9   133710839   G   T   0.352   NM_005157.6 9   133710839   1    0.718001

我知道如何应用这个条件

IGV_table_limited[IGV_table_limited.duplicated(subset=['Chr','Start', 'Ref','Revel_Score'  ], keep=False)]

我知道如何在那一列做数学运算,但我不知道如何得到原始表格的变化

8e2ybdfx

8e2ybdfx1#

使用numpy.random.normalDataFrame.loc,对于计数True,使用sum

m = IGV_table_limited.duplicated(subset=['Chr','Start', 'Ref','Revel_Score'  ], keep=False)

IGV_table_limited.loc[m, 'P'] += np.random.normal(0.01, 0.1, size=m.sum())

如果确实需要步骤0.01,则使用numpy.random.choice

arr = np.arange(0.01, 0.1 + 0.01, 0.01)
print (arr)
[0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ]

IGV_table_limited.loc[m, 'P'] += np.random.choice(arr, size=m.sum())

我测试P列中索引28发生变化且无负值的数据:

d = {'Chr': {21: '9',
  22: '9',
  23: '9',
  24: '9',
  25: '9',
  26: '9',
  27: '9',
  28: '9',
  29: '9',
  30: '9',
  31: '9',
  32: '9',
  33: '9',
  34: '9',
  35: '9'},
 'Start': {21: 133710834,
  22: 133710834,
  23: 133710834,
  24: 133710835,
  25: 133710835,
  26: 133710835,
  27: 133710836,
  28: 133710836,
  29: 133710836,
  30: 133710837,
  31: 133710837,
  32: 133710838,
  33: 133710838,
  34: 133710839,
  35: 133710839},
 'Ref': {21: 'A',
  22: 'A',
  23: 'A',
  24: 'T',
  25: 'T',
  26: 'T',
  27: 'G',
  28: 'G',
  29: 'G',
  30: 'T',
  31: 'T',
  32: 'T',
  33: 'T',
  34: 'G',
  35: 'G'},
 'Alt': {21: 'C',
  22: 'G',
  23: 'T',
  24: 'A',
  25: 'C',
  26: 'G',
  27: 'A',
  28: 'C',
  29: 'T',
  30: 'A',
  31: 'G',
  32: 'C',
  33: 'G',
  34: 'C',
  35: 'T'},
 'Revel_Score': {21: 0.429,
  22: 0.424,
  23: 0.432,
  24: 0.395,
  25: 0.386,
  26: 0.389,
  27: 0.429,
  28: 0.418,
  29: 0.418,
  30: 0.381,
  31: 0.278,
  32: 0.378,
  33: 0.327,
  34: 0.352,
  35: 0.352},
 'RefSeq': {21: 'NM_005157.6',
  22: 'NM_005157.6',
  23: 'NM_005157.6',
  24: 'NM_005157.6',
  25: 'NM_005157.6',
  26: 'NM_005157.6',
  27: 'NM_005157.6',
  28: 'NM_005157.6',
  29: 'NM_005157.6',
  30: 'NM_005157.6',
  31: 'NM_005157.6',
  32: 'NM_005157.6',
  33: 'NM_005157.6',
  34: 'NM_005157.6',
  35: 'NM_005157.6'},
 'CHR': {21: '9',
  22: '9',
  23: '9',
  24: '9',
  25: '9',
  26: '9',
  27: '9',
  28: '9',
  29: '9',
  30: '9',
  31: '9',
  32: '9',
  33: '9',
  34: '9',
  35: '9'},
 'BP': {21: 133710834,
  22: 133710834,
  23: 133710834,
  24: 133710835,
  25: 133710835,
  26: 133710835,
  27: 133710836,
  28: 133710836,
  29: 133710836,
  30: 133710837,
  31: 133710837,
  32: 133710838,
  33: 133710838,
  34: 133710839,
  35: 133710839},
 'SNP': {21: 1,
  22: 1,
  23: 1,
  24: 1,
  25: 1,
  26: 1,
  27: 1,
  28: 1,
  29: 1,
  30: 1,
  31: 1,
  32: 1,
  33: 1,
  34: 1,
  35: 1},
 'P': {21: 0.571001,
  22: 0.5760010000000001,
  23: 0.5680010000000001,
  24: 0.605001,
  25: 0.614001,
  26: 0.611001,
  27: 0.571001,
  28: 0.3220009999999999,
  29: 0.5820010000000001,
  30: 0.619001,
  31: 0.722001,
  32: 0.622001,
  33: 0.6730010000000001,
  34: 0.648001,
  35: 0.648001}}
IGV_table_limited = pd.DataFrame(d)

m = IGV_table_limited.duplicated(subset=['Chr','Start', 'Ref','Revel_Score'],keep=False)

IGV_table_limited.loc[m, 'P'] += np.random.normal(0.01, 0.1, size=m.sum())
print (IGV_table_limited)
   Chr      Start Ref Alt  Revel_Score       RefSeq CHR         BP  SNP  \
21   9  133710834   A   C        0.429  NM_005157.6   9  133710834    1   
22   9  133710834   A   G        0.424  NM_005157.6   9  133710834    1   
23   9  133710834   A   T        0.432  NM_005157.6   9  133710834    1   
24   9  133710835   T   A        0.395  NM_005157.6   9  133710835    1   
25   9  133710835   T   C        0.386  NM_005157.6   9  133710835    1   
26   9  133710835   T   G        0.389  NM_005157.6   9  133710835    1   
27   9  133710836   G   A        0.429  NM_005157.6   9  133710836    1   
28   9  133710836   G   C        0.418  NM_005157.6   9  133710836    1   
29   9  133710836   G   T        0.418  NM_005157.6   9  133710836    1   
30   9  133710837   T   A        0.381  NM_005157.6   9  133710837    1   
31   9  133710837   T   G        0.278  NM_005157.6   9  133710837    1   
32   9  133710838   T   C        0.378  NM_005157.6   9  133710838    1   
33   9  133710838   T   G        0.327  NM_005157.6   9  133710838    1   
34   9  133710839   G   C        0.352  NM_005157.6   9  133710839    1   
35   9  133710839   G   T        0.352  NM_005157.6   9  133710839    1   

           P  
21  0.571001  
22  0.576001  
23  0.568001  
24  0.605001  
25  0.614001  
26  0.611001  
27  0.571001  
28  0.412194  
29  0.576668  
30  0.619001  
31  0.722001  
32  0.622001  
33  0.673001  
34  0.786122  
35  0.707951
IGV_table_limited = pd.DataFrame(d)

m = IGV_table_limited.duplicated(subset=['Chr','Start', 'Ref','Revel_Score'],keep=False)

arr = np.arange(0.01, 0.1 + 0.01, 0.01)
print (arr)
[0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ]

IGV_table_limited.loc[m, 'P'] += np.random.choice(arr, size=m.sum())
print (IGV_table_limited)
   Chr      Start Ref Alt  Revel_Score       RefSeq CHR         BP  SNP  \
21   9  133710834   A   C        0.429  NM_005157.6   9  133710834    1   
22   9  133710834   A   G        0.424  NM_005157.6   9  133710834    1   
23   9  133710834   A   T        0.432  NM_005157.6   9  133710834    1   
24   9  133710835   T   A        0.395  NM_005157.6   9  133710835    1   
25   9  133710835   T   C        0.386  NM_005157.6   9  133710835    1   
26   9  133710835   T   G        0.389  NM_005157.6   9  133710835    1   
27   9  133710836   G   A        0.429  NM_005157.6   9  133710836    1   
28   9  133710836   G   C        0.418  NM_005157.6   9  133710836    1   
29   9  133710836   G   T        0.418  NM_005157.6   9  133710836    1   
30   9  133710837   T   A        0.381  NM_005157.6   9  133710837    1   
31   9  133710837   T   G        0.278  NM_005157.6   9  133710837    1   
32   9  133710838   T   C        0.378  NM_005157.6   9  133710838    1   
33   9  133710838   T   G        0.327  NM_005157.6   9  133710838    1   
34   9  133710839   G   C        0.352  NM_005157.6   9  133710839    1   
35   9  133710839   G   T        0.352  NM_005157.6   9  133710839    1   

           P  
21  0.571001  
22  0.576001  
23  0.568001  
24  0.605001  
25  0.614001  
26  0.611001  
27  0.571001  
28  0.422001  
29  0.652001  
30  0.619001  
31  0.722001  
32  0.622001  
33  0.673001  
34  0.688001  
35  0.688001

相关问题