使用Bash脚本如何找到FASTA文件的GC含量?

huangapple go评论69阅读模式
英文:

How can I find the GC-content of a FASTA file using a Bash script?

问题

以下是翻译好的部分:

"I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).

I am trying to use the wc command. But I was not able to get an answer.


After going through documentation and videos, I came up with a solution.

filename=$@                     # Collecting all the filenames as parameters
for f in $filename              # Looping over files
do
    echo " $f is being processed..."
    gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l))    # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
    total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c))          # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
    percent=( $( echo "scale=2;100*$gc/$total" |bc -l))             # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
    echo " The GC content of $f is: "$percent"%"
    echo
done

I am learning bioinformatics.

英文:

I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).

I am trying to use the wc command. But I was not able to get an answer.


After going through documentation and videos, I came up with a solution.

filename=$@                     # Collecting all the filenames as parameters
for f in $filename              # Looping over files
do
    echo " $f is being processed..."
    gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l))    # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
    total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c))          # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
    percent=( $( echo "scale=2;100*$gc/$total" |bc -l))             # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
    echo " The GC content of $f is: "$percent"%"
    echo
done

I am learning bioinformatics.

答案1

得分: 1

不要重新发明轮子。对于常见的生物信息学任务,请使用专为这些任务设计、经过充分测试、被广泛使用并处理边缘情况的开源工具。例如,使用 EMBOSSinfoseq 实用程序。EMBOSS 可以轻松安装,例如使用 conda

示例:

安装 EMBOSS 包(仅需执行一次):

conda create --name emboss emboss --channel iuc

激活 conda 环境并使用 EMBOSSinfoseq,这里用于打印序列名称、长度和 GC 含量:

source activate emboss

cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin

source deactivate

这将在 标准输出 中打印类似以下的内容:

Name           Length %GC
seq_foo        119    60.50
seq_bar        104    39.42
seq_baz        191    46.60
...
英文:

Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, use EMBOSS infoseq utility. EMBOSS can be easily installed, for example using conda.

Example:

Install EMBOSS package (do once):

conda create --name emboss emboss --channel iuc

Activate the conda environment and use EMBOSS infoseq, here to print the sequence name, length and percent GC:

source activate emboss

cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin

source deactivate

This prints into standard output something like this:

Name           Length %GC
seq_foo        119    60.50
seq_bar        104    39.42
seq_baz        191    46.60
...

答案2

得分: -1

这应该有效:

#!/usr/bin/env sh
# 改编自 https://www.biostars.org/p/17680

# 遇到错误时退出
set -o errexit
# 禁用未定义变量引用
set -o nounset

# ===================
# 配置
# ===================
# Fasta 文件路径
FASTA_FILE="file.fasta"
# 小数点后的位数
N_DIGITS=3

# ===================
# 日志记录
# ===================
# 致命错误日志消息
fatal() {
  printf '[FATAL] %s\n' "$@" >&2
  exit 1
}

# 信息日志消息
info() {
  printf '[INFO ] %s\n' "$@"
}

# ===================
# 主程序
# ===================
{
  # 检查命令 'bc' 是否存在
  command -v bc > /dev/null 2>&1 || fatal "找不到命令 'bc'"
  # 检查文件是否存在
  [ -f "$FASTA_FILE" ] || fatal "找不到文件 '$FASTA_FILE'"

  # 统计序列的数量
  _n_sequences=$(grep --count '^>' "$FASTA_FILE")
  info "分析 $_n_sequences 个序列"
  [ "$_n_sequences" -ne 0 ] || fatal "找不到任何序列"

  # 移除序列的换行符
  _fasta_file_content=$(
    sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
      | tr --delete "\r\n" \
      | sed 's/$/#/' \
      | tr "#" "\n" \
      | sed '/^$/d'
  )

  # 变量
  _sequence=
  _a_count_total=0
  _c_count_total=0
  _g_count_total=0
  _t_count_total=0

  # 逐行读取
  while IFS= read -r _line; do
    # 检查是否为标题行
    if printf '%s\n' "$_line" | grep --quiet '^>'; then
      # 保存序列并继续
      _sequence=${_line#?}
      continue
    fi

    # 统计
    _a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
    _c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
    _g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
    _t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)

    # 将当前计数添加到总计
    _a_count_total=$((_a_count_total + _a_count))
    _c_count_total=$((_c_count_total + _c_count))
    _g_count_total=$((_g_count_total + _g_count))
    _t_count_total=$((_t_count_total + _t_count))

    # 计算 GC 含量
    _gc=$(
      printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
        "$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
        | bc
    )
    # 在小数点前添加 0
    _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
    
    info "序列 '$_sequence' 的 GC 含量: $_gc"
  done <<EOF
$_fasta_file_content
EOF

  # 总计数据
  info "腺嘌呤总计数: $_a_count_total"
  info "胞嘧啶总计数: $_c_count_total"
  info "鸟嘌呤总计数: $_g_count_total"
  info "胸腺嘧啶总计数: $_t_count_total"

  # 计算总的 GC 含量
  _gc=$(
    printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
      "$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
      | bc
  )
  # 在小数点前添加 0
  _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
  info "GC 含量: $_gc"
}

"统计序列的数量" 和 "移除序列的换行符" 代码改编自 https://www.biostars.org/p/17680

脚本只使用了基本命令,除了 bc 用于进行精确计算(请参阅 bc 安装)。

您可以通过修改“配置”部分中的变量来配置脚本。

由于您没有指明要哪个部分,因此计算了每个序列和整体的 GC 含量。因此,您可以删除不需要的部分 使用Bash脚本如何找到FASTA文件的GC含量? 尽管我没有生物信息学背景,但该脚本成功解析和分析了 fasta 文件。

英文:

This should work:

#!/usr/bin/env sh
# Adapted from https://www.biostars.org/p/17680

# Fail on error
set -o errexit
# Disable undefined variable reference
set -o nounset

# ================
# CONFIGURATION
# ================
# Fasta file path
FASTA_FILE=&quot;file.fasta&quot;
# Number of digits after decimal point
N_DIGITS=3

# ================
# LOGGER
# ================
# Fatal log message
fatal() {
  printf &#39;[FATAL] %s\n&#39; &quot;$@&quot; &gt;&amp;2
  exit 1
}

# Info log message
info() {
  printf &#39;[INFO ] %s\n&#39; &quot;$@&quot;
}

# ================
# MAIN
# ================
{
  # Check command &#39;bc&#39; exist
  command -v bc &gt; /dev/null 2&gt;&amp;1 || fatal &quot;Command &#39;bc&#39; not found&quot;
  # Check file exist
  [ -f &quot;$FASTA_FILE&quot; ] || fatal &quot;File &#39;$FASTA_FILE&#39; not found&quot;

  # Count number of sequences
  _n_sequences=$(grep --count &#39;^&gt;&#39; &quot;$FASTA_FILE&quot;)
  info &quot;Analyzing $_n_sequences sequences&quot;
  [ &quot;$_n_sequences&quot; -ne 0 ] || fatal &quot;No sequences found&quot;

  # Remove sequence wrapping
  _fasta_file_content=$(
    sed &#39;s/\(^&gt;.*$\)/#\1#/&#39; &quot;$FASTA_FILE&quot; \
      | tr --delete &quot;\r\n&quot; \
      | sed &#39;s/$/#/&#39; \
      | tr &quot;#&quot; &quot;\n&quot; \
      | sed &#39;/^$/d&#39;
  )

  # Vars
  _sequence=
  _a_count_total=0
  _c_count_total=0
  _g_count_total=0
  _t_count_total=0

  # Read line by line
  while IFS= read -r _line; do
    # Check if header
    if printf &#39;%s\n&#39; &quot;$_line&quot; | grep --quiet &#39;^&gt;&#39;; then
      # Save sequence and continue
      _sequence=${_line#?}
      continue
    fi

    # Count
    _a_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;A&#39; | wc --bytes)
    _c_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;C&#39; | wc --bytes)
    _g_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;G&#39; | wc --bytes)
    _t_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;T&#39; | wc --bytes)

    # Add current count to total
    _a_count_total=$((_a_count_total + _a_count))
    _c_count_total=$((_c_count_total + _c_count))
    _g_count_total=$((_g_count_total + _g_count))
    _t_count_total=$((_t_count_total + _t_count))

    # Calculate GC content
    _gc=$(
      printf &#39;scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n&#39; \
        &quot;$N_DIGITS&quot; &quot;$_a_count&quot; &quot;$_c_count&quot; &quot;$_g_count&quot; &quot;$_t_count&quot; \
        | bc
    )
    # Add 0 before decimal point
    _gc=&quot;$(printf &quot;%.${N_DIGITS}f\n&quot; &quot;$_gc&quot;)&quot;

    info &quot;Sequence &#39;$_sequence&#39; GC content: $_gc&quot;
  done &lt;&lt; EOF
$_fasta_file_content
EOF

  # Total data
  info &quot;Adenine total count: $_a_count_total&quot;
  info &quot;Cytosine total count: $_c_count_total&quot;
  info &quot;Guanine total count: $_g_count_total&quot;
  info &quot;Thymine total count: $_t_count_total&quot;

  # Calculate total GC content
  _gc=$(
    printf &#39;scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n&#39; \
      &quot;$N_DIGITS&quot; &quot;$_a_count_total&quot; &quot;$_c_count_total&quot; &quot;$_g_count_total&quot; &quot;$_t_count_total&quot; \
      | bc
  )
  # Add 0 before decimal point
  _gc=&quot;$(printf &quot;%.${N_DIGITS}f\n&quot; &quot;$_gc&quot;)&quot;
  info &quot;GC content: $_gc&quot;
}

The "Count number of sequences" and "Remove sequence wrapping" codes are adapted from https://www.biostars.org/p/17680

The script uses only basic commands except for bc to do the precision calculation (See bc installation).

You can configure the script by modifying the variables in the CONFIGURATION section.

Because you haven't indicated which one you want, the GC content is calculated for both each sequence and the overall. Therefore, get rid of anything that isn't necessary 使用Bash脚本如何找到FASTA文件的GC含量?

Despite my lack of bioinformatics background, the script successfully parses and analyzes a fasta file.

huangapple
  • 本文由 发表于 2023年2月8日 15:58:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75382780.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定