linux 如何使用Bash脚本查找FASTA文件的GC内容?

brtdzjyr  于 2023-08-03  发布在  Linux
关注(0)|答案(3)|浏览(113)

我想使用Bash脚本从FASTA format文件中找到GC-content。GC含量基本上是 ((G + C)的数量)/((A + T + G + C)的数量)
我正在尝试使用wc命令。但我没能得到答案。
在浏览了文档和视频之后,我想出了一个解决方案。

filename=$@                     # Collecting all the filenames as parameters
for f in $filename              # Looping over files
do
    echo " $f is being processed..."
    gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l))    # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
    total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c))          # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
    percent=( $( echo "scale=2;100*$gc/$total" |bc -l))             # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
    echo " The GC content of $f is: "$percent"%"
    echo
done

字符串
我在学生物信息学。

gcmastyq

gcmastyq1#

不要重新发明轮子。对于常见的生物信息学任务,请使用专门为这些任务设计的开源工具,这些工具经过良好的测试,广泛使用,并处理边缘情况。例如,使用EMBOSSinfoseq实用程序。EMBOSS可以很容易地安装,例如使用conda
示例如下:
安装EMBOSS软件包(执行一次):

conda create --name emboss emboss --channel iuc

字符串
激活conda环境并使用EMBOSSinfoseq,在这里打印序列名称、长度和GC百分比:

source activate emboss

cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin

source deactivate


这将打印到标准输出中,如下所示:

Name           Length %GC
seq_foo        119    60.50
seq_bar        104    39.42
seq_baz        191    46.60
...

yquaqz18

yquaqz182#

你可以使用基本的文本处理工具,如grep,awk和wc:

#!/bin/bash

fasta_file="your_fasta_file.fa"

# Extract the sequence from the FASTA file, excluding the header
sequence=$(grep -v '^>' "$fasta_file" | tr -d '\n')

# Calculate the number of GC and ATCG bases in the sequence
gc_count=$(echo -n "$sequence" | grep -o '[GCgc]' | wc -l)
atcg_count=$(echo -n "$sequence" | grep -o '[ATCGatcg]' | wc -l)

# Calculate the GC content
gc_content=$(awk "BEGIN { print ($gc_count / $atcg_count) * 100 }")

# Print the GC content
echo "GC content: $gc_content%"

字符串

  • 使用grep从FASTA文件中排除以>(标题行)开头的行,使用tr删除换行符,从而生成存储在sequence变量中的单行序列。
  • 使用grep -o '[GCgc]'查找序列中出现的'G'或'C'(不区分大小写),使用wc -l计算出现次数。将计数存储在gc_count变量中。
  • 使用grep -o '[ATCGatcg]'查找序列中出现的任何核苷酸('A'、'T'、'C'或'G'),使用wc -l计算出现次数。将计数存储在atcg_count变量中。
  • 使用awk计算GC含量,方法是将gc_count除以atcg_count,将结果乘以100并打印出来。
8gsdolmq

8gsdolmq3#

这应该可以工作:

#!/usr/bin/env sh
# Adapted from https://www.biostars.org/p/17680

# Fail on error
set -o errexit
# Disable undefined variable reference
set -o nounset

# ================
# CONFIGURATION
# ================
# Fasta file path
FASTA_FILE="file.fasta"
# Number of digits after decimal point
N_DIGITS=3

# ================
# LOGGER
# ================
# Fatal log message
fatal() {
  printf '[FATAL] %s\n' "$@" >&2
  exit 1
}

# Info log message
info() {
  printf '[INFO ] %s\n' "$@"
}

# ================
# MAIN
# ================
{
  # Check command 'bc' exist
  command -v bc > /dev/null 2>&1 || fatal "Command 'bc' not found"
  # Check file exist
  [ -f "$FASTA_FILE" ] || fatal "File '$FASTA_FILE' not found"

  # Count number of sequences
  _n_sequences=$(grep --count '^>' "$FASTA_FILE")
  info "Analyzing $_n_sequences sequences"
  [ "$_n_sequences" -ne 0 ] || fatal "No sequences found"

  # Remove sequence wrapping
  _fasta_file_content=$(
    sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
      | tr --delete "\r\n" \
      | sed 's/$/#/' \
      | tr "#" "\n" \
      | sed '/^$/d'
  )

  # Vars
  _sequence=
  _a_count_total=0
  _c_count_total=0
  _g_count_total=0
  _t_count_total=0

  # Read line by line
  while IFS= read -r _line; do
    # Check if header
    if printf '%s\n' "$_line" | grep --quiet '^>'; then
      # Save sequence and continue
      _sequence=${_line#?}
      continue
    fi

    # Count
    _a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
    _c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
    _g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
    _t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)

    # Add current count to total
    _a_count_total=$((_a_count_total + _a_count))
    _c_count_total=$((_c_count_total + _c_count))
    _g_count_total=$((_g_count_total + _g_count))
    _t_count_total=$((_t_count_total + _t_count))

    # Calculate GC content
    _gc=$(
      printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
        "$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
        | bc
    )
    # Add 0 before decimal point
    _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"

    info "Sequence '$_sequence' GC content: $_gc"
  done << EOF
$_fasta_file_content
EOF

  # Total data
  info "Adenine total count: $_a_count_total"
  info "Cytosine total count: $_c_count_total"
  info "Guanine total count: $_g_count_total"
  info "Thymine total count: $_t_count_total"

  # Calculate total GC content
  _gc=$(
    printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
      "$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
      | bc
  )
  # Add 0 before decimal point
  _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
  info "GC content: $_gc"
}

字符串
Count number of sequences”和“Remove sequence wrapping”代码改编自https://www.biostars.org/p/17680
该脚本仅使用除bc之外的基本命令来执行精度计算(请参见bc installation)。
您可以通过修改CONFIGURATION部分中的变量来配置脚本。
因为您还没有指出您想要哪一个,所以GC含量是为每个序列和整个序列计算的。因此,删除任何不必要的东西:)
尽管我缺乏生物信息学背景,但该脚本成功地解析和分析了一个fasta文件。

相关问题