linux 如何提取在簇中作为单例存在的蛋白质序列ID?[关闭]

rqqzpn5f  于 2023-06-05  发布在  Linux
关注(0)|答案(1)|浏览(337)

已关闭,此问题需要更focused。目前不接受答复。
**想改善这个问题吗?**更新问题,使其仅通过editing this post关注一个问题。

2个月前关闭。
Improve this question
我有一个很大的数据集,包含蛋白质序列的集群。簇编号和列出在每个簇中发现的蛋白质序列的许多行用作其表示。一些蛋白质序列在簇内出现多次,而其他蛋白质序列仅出现一次(即,单例)。我想提取每个簇中作为单例存在的蛋白质序列ID。
下面是数据集的一个例子:

>Cluster 0
0       310aa, >ref_ENST00000279791... at 100.00%
1       415aa, >ref_ENST00000641310... *
>Cluster 1
0       310aa, >ENST00000279791.590... at 100.00%
1       310aa, >ENST00000332650.693... at 100.00%
2       413aa, >ENST00000641310.590... *
3       310aa, >ENST00000279791.590... at 99.35%
4       310aa, >ENST00000332650.693... at 99.35%
>Cluster 2
0       399aa, >ENST00000641310.394... *
>Cluster 3
0       311aa, >ENST00000641081.179... at 96.14%
1       395aa, >ENST00000641310.395... *
2       311aa, >ENST00000641581.842... at 96.14%
3       311aa, >ENST00000641668.842... at 96.14%
4       311aa, >ENST00000641081.179... at 96.14%
5       299aa, >ENST00000641310.395... at 100.00%
6       311aa, >ENST00000641581.842... at 96.14%
7       311aa, >ENST00000641668.842... at 96.14%
>Cluster 4
0       380aa, >ENST00000641310.583... *
1       314aa, >ENST00000332238.915... at 95.86%
2       310aa, >ENST00000641310.583... at 97.10%
>Cluster 5
0       370aa, >ref_ENST00000314644... *
1       316aa, >ref_ENST00000642128... at 100.00%
>Cluster 6
0       367aa, >ENST00000641310.213... *
1       326aa, >ENST00000531945.112... at 96.32%
2       319aa, >ENST00000641123.112... at 98.12%
3       313aa, >ENST00000641310.213... at 99.68%
>Cluster 7
0       367aa, >ENST00000641310.284... *

在这个例子中,我想提取在每个簇中只出现一次的蛋白质序列ID(即单例)。基于给定的数据集,期望的输出应包括以下蛋白质序列ID:

ENST00000641310.394
ENST00000641310.284
#!/bin/bash

# Assuming the dataset is stored in a file called "dataset.txt"
input_file="dataset.txt"

# Loop through each line in the input file
while IFS= read -r line; do
  # Check if the line starts with ">Cluster"
  if [[ $line == ">Cluster"* ]]; then
    cluster_number=${line#>Cluster }
    cluster_number=${cluster_number//[^0-9]/}
    cluster_found=false
  fi

  # Check if the line contains a singleton protein sequence
  if [[ $line == *"... *" ]]; then
    protein_sequence=$(echo "$line" | awk -F"[>, ]" '{print $4}')
    cluster_found=true
  fi

  # Print the singleton protein sequence if a cluster was found
  if [[ $cluster_found == true ]]; then
    echo "$protein_sequence"
  fi
done < "$input_file"

我尝试了以下脚本,但它不起作用。
如果你有任何疑问请告诉我。

mmvthczy

mmvthczy1#

如果我把你的数据放在一个名为protein.txt的文件中,那么我可以在Linux上这样做(注意RS='>Cluster'需要GNU awk):

awk -F'\n' -v RS='>Cluster' 'NF==3' protein.txt

这给了我一个单态的簇中的行:

2
0       399aa, >ENST00000641310.394... *

 7
0       367aa, >ENST00000641310.284... *

这就是你要找的吗

相关问题