2023年2月8日 15:58:23go评论93阅读模式

英文:

How can I find the GC-content of a FASTA file using a Bash script?

问题

以下是翻译好的部分：

"I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).

I am trying to use the wc command. But I was not able to get an answer.

After going through documentation and videos, I came up with a solution.

filename=$@                     # Collecting all the filenames as parameters
for f in $filename              # Looping over files
do
    echo &quot; $f is being processed...&quot;
    gc=( $( grep -v &quot;&gt;&quot; &lt; &quot;$f&quot; | grep -io &#39;g\|c&#39;&lt; &quot;$f&quot; | wc -l))    # Reading lines that don’t start with &lt; using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
    total=( $( grep -v &quot;&gt;&quot; &lt; &quot;$f&quot; | tr -d &#39;\s\r&#39; | wc -c))          # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
    percent=( $( echo &quot;scale=2;100*$gc/$total&quot; |bc -l))             # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
    echo &quot; The GC content of $f is: &quot;$percent&quot;%&quot;
    echo
done

I am learning bioinformatics.

英文:

I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).

I am trying to use the wc command. But I was not able to get an answer.

After going through documentation and videos, I came up with a solution.

filename=$@                     # Collecting all the filenames as parameters
for f in $filename              # Looping over files
do
    echo &quot; $f is being processed...&quot;
    gc=( $( grep -v &quot;&gt;&quot; &lt; &quot;$f&quot; | grep -io &#39;g\|c&#39;&lt; &quot;$f&quot; | wc -l))    # Reading lines that don’t start with &lt; using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
    total=( $( grep -v &quot;&gt;&quot; &lt; &quot;$f&quot; | tr -d &#39;\s\r&#39; | wc -c))          # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
    percent=( $( echo &quot;scale=2;100*$gc/$total&quot; |bc -l))             # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
    echo &quot; The GC content of $f is: &quot;$percent&quot;%&quot;
    echo
done

I am learning bioinformatics.

答案1

得分: 1

不要重新发明轮子。对于常见的生物信息学任务，请使用专为这些任务设计、经过充分测试、被广泛使用并处理边缘情况的开源工具。例如，使用 EMBOSS 的 infoseq 实用程序。EMBOSS 可以轻松安装，例如使用 conda。

示例：

安装 EMBOSS 包（仅需执行一次）：

conda create --name emboss emboss --channel iuc

激活 conda 环境并使用 EMBOSS 的 infoseq，这里用于打印序列名称、长度和 GC 含量：

source activate emboss
cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
source deactivate

这将在标准输出中打印类似以下的内容：

Name           Length %GC
seq_foo        119    60.50
seq_bar        104    39.42
seq_baz        191    46.60
...

英文:

Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, use EMBOSS infoseq utility. EMBOSS can be easily installed, for example using conda.

Example:

Install EMBOSS package (do once):

conda create --name emboss emboss --channel iuc

Activate the conda environment and use EMBOSS infoseq, here to print the sequence name, length and percent GC:

source activate emboss
cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
source deactivate

This prints into standard output something like this:

Name           Length %GC
seq_foo        119    60.50
seq_bar        104    39.42
seq_baz        191    46.60
...

答案2

得分: -1

这应该有效：

#!/usr/bin/env sh
# 改编自 https://www.biostars.org/p/17680
# 遇到错误时退出
set -o errexit
# 禁用未定义变量引用
set -o nounset
# ===================
# 配置
# ===================
# Fasta 文件路径
FASTA_FILE="file.fasta"
# 小数点后的位数
N_DIGITS=3
# ===================
# 日志记录
# ===================
# 致命错误日志消息
fatal() {
  printf '[FATAL] %s\n' "$@" >&2
  exit 1
}
# 信息日志消息
info() {
  printf '[INFO ] %s\n' "$@"
}
# ===================
# 主程序
# ===================
{
  # 检查命令 'bc' 是否存在
  command -v bc > /dev/null 2>&1 || fatal "找不到命令 'bc'"
  # 检查文件是否存在
  [ -f "$FASTA_FILE" ] || fatal "找不到文件 '$FASTA_FILE'"
  # 统计序列的数量
  _n_sequences=$(grep --count '^>' "$FASTA_FILE")
  info "分析 $_n_sequences 个序列"
  [ "$_n_sequences" -ne 0 ] || fatal "找不到任何序列"
  # 移除序列的换行符
  _fasta_file_content=$(
    sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
      | tr --delete "\r\n" \
      | sed 's/$/#/' \
      | tr "#" "\n" \
      | sed '/^$/d'
  )
  # 变量
  _sequence=
  _a_count_total=0
  _c_count_total=0
  _g_count_total=0
  _t_count_total=0
  # 逐行读取
  while IFS= read -r _line; do
    # 检查是否为标题行
    if printf '%s\n' "$_line" | grep --quiet '^>'; then
      # 保存序列并继续
      _sequence=${_line#?}
      continue
    fi
    # 统计
    _a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
    _c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
    _g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
    _t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)
    # 将当前计数添加到总计
    _a_count_total=$((_a_count_total + _a_count))
    _c_count_total=$((_c_count_total + _c_count))
    _g_count_total=$((_g_count_total + _g_count))
    _t_count_total=$((_t_count_total + _t_count))
    # 计算 GC 含量
    _gc=$(
      printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
        "$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
        | bc
    )
    # 在小数点前添加 0
    _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
    
    info "序列 '$_sequence' 的 GC 含量: $_gc"
  done <<EOF
$_fasta_file_content
EOF
  # 总计数据
  info "腺嘌呤总计数: $_a_count_total"
  info "胞嘧啶总计数: $_c_count_total"
  info "鸟嘌呤总计数: $_g_count_total"
  info "胸腺嘧啶总计数: $_t_count_total"
  # 计算总的 GC 含量
  _gc=$(
    printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
      "$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
      | bc
  )
  # 在小数点前添加 0
  _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
  info "GC 含量: $_gc"
}

"统计序列的数量" 和 "移除序列的换行符" 代码改编自 https://www.biostars.org/p/17680

脚本只使用了基本命令，除了 bc 用于进行精确计算（请参阅 bc 安装）。

您可以通过修改“配置”部分中的变量来配置脚本。

由于您没有指明要哪个部分，因此计算了每个序列和整体的 GC 含量。因此，您可以删除不需要的部分尽管我没有生物信息学背景，但该脚本成功解析和分析了 fasta 文件。

英文:

This should work:

#!/usr/bin/env sh
# Adapted from https://www.biostars.org/p/17680
# Fail on error
set -o errexit
# Disable undefined variable reference
set -o nounset
# ================
# CONFIGURATION
# ================
# Fasta file path
FASTA_FILE=&quot;file.fasta&quot;
# Number of digits after decimal point
N_DIGITS=3
# ================
# LOGGER
# ================
# Fatal log message
fatal() {
  printf &#39;[FATAL] %s\n&#39; &quot;$@&quot; &gt;&amp;2
  exit 1
}
# Info log message
info() {
  printf &#39;[INFO ] %s\n&#39; &quot;$@&quot;
}
# ================
# MAIN
# ================
{
  # Check command &#39;bc&#39; exist
  command -v bc &gt; /dev/null 2&gt;&amp;1 || fatal &quot;Command &#39;bc&#39; not found&quot;
  # Check file exist
  [ -f &quot;$FASTA_FILE&quot; ] || fatal &quot;File &#39;$FASTA_FILE&#39; not found&quot;
  # Count number of sequences
  _n_sequences=$(grep --count &#39;^&gt;&#39; &quot;$FASTA_FILE&quot;)
  info &quot;Analyzing $_n_sequences sequences&quot;
  [ &quot;$_n_sequences&quot; -ne 0 ] || fatal &quot;No sequences found&quot;
  # Remove sequence wrapping
  _fasta_file_content=$(
    sed &#39;s/\(^&gt;.*$\)/#\1#/&#39; &quot;$FASTA_FILE&quot; \
      | tr --delete &quot;\r\n&quot; \
      | sed &#39;s/$/#/&#39; \
      | tr &quot;#&quot; &quot;\n&quot; \
      | sed &#39;/^$/d&#39;
  )
  # Vars
  _sequence=
  _a_count_total=0
  _c_count_total=0
  _g_count_total=0
  _t_count_total=0
  # Read line by line
  while IFS= read -r _line; do
    # Check if header
    if printf &#39;%s\n&#39; &quot;$_line&quot; | grep --quiet &#39;^&gt;&#39;; then
      # Save sequence and continue
      _sequence=${_line#?}
      continue
    fi
    # Count
    _a_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;A&#39; | wc --bytes)
    _c_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;C&#39; | wc --bytes)
    _g_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;G&#39; | wc --bytes)
    _t_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;T&#39; | wc --bytes)
    # Add current count to total
    _a_count_total=$((_a_count_total + _a_count))
    _c_count_total=$((_c_count_total + _c_count))
    _g_count_total=$((_g_count_total + _g_count))
    _t_count_total=$((_t_count_total + _t_count))
    # Calculate GC content
    _gc=$(
      printf &#39;scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n&#39; \
        &quot;$N_DIGITS&quot; &quot;$_a_count&quot; &quot;$_c_count&quot; &quot;$_g_count&quot; &quot;$_t_count&quot; \
        | bc
    )
    # Add 0 before decimal point
    _gc=&quot;$(printf &quot;%.${N_DIGITS}f\n&quot; &quot;$_gc&quot;)&quot;
    info &quot;Sequence &#39;$_sequence&#39; GC content: $_gc&quot;
  done &lt;&lt; EOF
$_fasta_file_content
EOF
  # Total data
  info &quot;Adenine total count: $_a_count_total&quot;
  info &quot;Cytosine total count: $_c_count_total&quot;
  info &quot;Guanine total count: $_g_count_total&quot;
  info &quot;Thymine total count: $_t_count_total&quot;
  # Calculate total GC content
  _gc=$(
    printf &#39;scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n&#39; \
      &quot;$N_DIGITS&quot; &quot;$_a_count_total&quot; &quot;$_c_count_total&quot; &quot;$_g_count_total&quot; &quot;$_t_count_total&quot; \
      | bc
  )
  # Add 0 before decimal point
  _gc=&quot;$(printf &quot;%.${N_DIGITS}f\n&quot; &quot;$_gc&quot;)&quot;
  info &quot;GC content: $_gc&quot;
}

The "Count number of sequences" and "Remove sequence wrapping" codes are adapted from https://www.biostars.org/p/17680

The script uses only basic commands except for bc to do the precision calculation (See bc installation).

You can configure the script by modifying the variables in the CONFIGURATION section.

Because you haven't indicated which one you want, the GC content is calculated for both each sequence and the overall. Therefore, get rid of anything that isn't necessary

Despite my lack of bioinformatics background, the script successfully parses and analyzes a fasta file.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Bash脚本如何找到FASTA文件的GC含量？

问题

答案1

答案2

Golang正则表达式在文件中精确匹配行

停止子Shell中的后台进程接收SIGINT信号，尽管在前台捕获信号。

刷新在bash中运行程序时的输出

如何使我的CSV比较结果适用于三个单独的列，而不是一个列？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。