linux 如何提取在簇中作为单例存在的蛋白质序列ID？[关闭]

rqqzpn5f 于 2023-06-05 发布在 Linux

关注(0)|答案(1)|浏览(337)

已关闭，此问题需要更focused。目前不接受答复。
**想改善这个问题吗？**更新问题，使其仅通过editing this post关注一个问题。

2个月前关闭。
Improve this question
我有一个很大的数据集，包含蛋白质序列的集群。簇编号和列出在每个簇中发现的蛋白质序列的许多行用作其表示。一些蛋白质序列在簇内出现多次，而其他蛋白质序列仅出现一次（即，单例）。我想提取每个簇中作为单例存在的蛋白质序列ID。
下面是数据集的一个例子：

>Cluster 0
0       310aa, >ref_ENST00000279791... at 100.00%
1       415aa, >ref_ENST00000641310... *
>Cluster 1
0       310aa, >ENST00000279791.590... at 100.00%
1       310aa, >ENST00000332650.693... at 100.00%
2       413aa, >ENST00000641310.590... *
3       310aa, >ENST00000279791.590... at 99.35%
4       310aa, >ENST00000332650.693... at 99.35%
>Cluster 2
0       399aa, >ENST00000641310.394... *
>Cluster 3
0       311aa, >ENST00000641081.179... at 96.14%
1       395aa, >ENST00000641310.395... *
2       311aa, >ENST00000641581.842... at 96.14%
3       311aa, >ENST00000641668.842... at 96.14%
4       311aa, >ENST00000641081.179... at 96.14%
5       299aa, >ENST00000641310.395... at 100.00%
6       311aa, >ENST00000641581.842... at 96.14%
7       311aa, >ENST00000641668.842... at 96.14%
>Cluster 4
0       380aa, >ENST00000641310.583... *
1       314aa, >ENST00000332238.915... at 95.86%
2       310aa, >ENST00000641310.583... at 97.10%
>Cluster 5
0       370aa, >ref_ENST00000314644... *
1       316aa, >ref_ENST00000642128... at 100.00%
>Cluster 6
0       367aa, >ENST00000641310.213... *
1       326aa, >ENST00000531945.112... at 96.32%
2       319aa, >ENST00000641123.112... at 98.12%
3       313aa, >ENST00000641310.213... at 99.68%
>Cluster 7
0       367aa, >ENST00000641310.284... *

在这个例子中，我想提取在每个簇中只出现一次的蛋白质序列ID（即单例）。基于给定的数据集，期望的输出应包括以下蛋白质序列ID：

ENST00000641310.394
ENST00000641310.284

#!/bin/bash

# Assuming the dataset is stored in a file called "dataset.txt"
input_file="dataset.txt"

# Loop through each line in the input file
while IFS= read -r line; do
  # Check if the line starts with ">Cluster"
  if [[ $line == ">Cluster"* ]]; then
    cluster_number=${line#>Cluster }
    cluster_number=${cluster_number//[^0-9]/}
    cluster_found=false
  fi

  # Check if the line contains a singleton protein sequence
  if [[ $line == *"... *" ]]; then
    protein_sequence=$(echo "$line" | awk -F"[>, ]" '{print $4}')
    cluster_found=true
  fi

  # Print the singleton protein sequence if a cluster was found
  if [[ $cluster_found == true ]]; then
    echo "$protein_sequence"
  fi
done < "$input_file"

我尝试了以下脚本，但它不起作用。
如果你有任何疑问请告诉我。

linux

来源：https://stackoverflow.com/questions/75864187/how-to-extract-protein-sequence-ids-that-are-present-as-singletons-in-a-cluster

1条答案

按热度按时间

mmvthczy1#

如果我把你的数据放在一个名为protein.txt的文件中，那么我可以在Linux上这样做（注意RS='>Cluster'需要GNU awk）：

awk -F'\n' -v RS='>Cluster' 'NF==3' protein.txt

这给了我一个单态的簇中的行：

2
0       399aa, >ENST00000641310.394... *

 7
0       367aa, >ENST00000641310.284... *

这就是你要找的吗

赞(0）回复(0）举报 2023-06-05

我来回答

linux 如何提取在簇中作为单例存在的蛋白质序列ID？[关闭]

1条答案

相关问题

热门标签

最新问答