已关闭。此问题需要超过focused。当前不接受答案。
**想要改进此问题吗?**更新此问题,使其仅关注editing this post的一个问题。
11小时前关门了。
Improve this question
我有一个大数据集,如下所示:
>Cluster 0
0 310aa, >ref_ENST00000279791... at 100.00%
1 415aa, >ref_ENST00000641310... *
>Cluster 1
0 310aa, >ENST00000279791.590... at 100.00%
1 310aa, >ENST00000332650.693... at 100.00%
2 413aa, >ENST00000641310.590... *
3 310aa, >ENST00000279791.590... at 99.35%
4 310aa, >ENST00000332650.693... at 99.35%
>Cluster 2
0 399aa, >ENST00000641310.394... *
>Cluster 3
0 311aa, >ENST00000641081.179... at 96.14%
1 395aa, >ENST00000641310.395... *
2 311aa, >ENST00000641581.842... at 96.14%
3 311aa, >ENST00000641668.842... at 96.14%
4 311aa, >ENST00000641081.179... at 96.14%
5 299aa, >ENST00000641310.395... at 100.00%
6 311aa, >ENST00000641581.842... at 96.14%
7 311aa, >ENST00000641668.842... at 96.14%
>Cluster 4
0 380aa, >ENST00000641310.583... *
1 314aa, >ENST00000332238.915... at 95.86%
2 310aa, >ENST00000641310.583... at 97.10%
>Cluster 5
0 370aa, >ref_ENST00000314644... *
1 316aa, >ref_ENST00000642128... at 100.00%
>Cluster 6
0 367aa, >ENST00000641310.213... *
1 326aa, >ENST00000531945.112... at 96.32%
2 319aa, >ENST00000641123.112... at 98.12%
3 313aa, >ENST00000641310.213... at 99.68%
>Cluster 7
0 367aa, >ENST00000641310.284... *
这是一个具有不同蛋白质序列ID的簇文件。仅提取簇中作为单态存在的那些行或蛋白质序列。
输出必须为(来自群集2和群集7):
ENST00000641310.394
ENST00000641310.284
如果有任何疑问请告诉我。
1条答案
按热度按时间hi3rlvi21#
如果我把你的数据放在一个名为protein.txt的文件中,那么我可以在Linux上这样做(注意
RS='>Cluster'
需要GNU awk
):这给了我一个单态的簇中的行:
这就是你要找的吗?