使用Bash脚本如何找到FASTA文件的GC含量?

huangapple go评论93阅读模式
英文:

How can I find the GC-content of a FASTA file using a Bash script?

问题

以下是翻译好的部分:

"I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).

I am trying to use the wc command. But I was not able to get an answer.


After going through documentation and videos, I came up with a solution.

  1. filename=$@ # Collecting all the filenames as parameters
  2. for f in $filename # Looping over files
  3. do
  4. echo " $f is being processed..."
  5. gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l)) # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
  6. total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c)) # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
  7. percent=( $( echo "scale=2;100*$gc/$total" |bc -l)) # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
  8. echo " The GC content of $f is: "$percent"%"
  9. echo
  10. done

I am learning bioinformatics.

英文:

I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).

I am trying to use the wc command. But I was not able to get an answer.


After going through documentation and videos, I came up with a solution.

  1. filename=$@ # Collecting all the filenames as parameters
  2. for f in $filename # Looping over files
  3. do
  4. echo " $f is being processed..."
  5. gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l)) # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
  6. total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c)) # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
  7. percent=( $( echo "scale=2;100*$gc/$total" |bc -l)) # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
  8. echo " The GC content of $f is: "$percent"%"
  9. echo
  10. done

I am learning bioinformatics.

答案1

得分: 1

不要重新发明轮子。对于常见的生物信息学任务,请使用专为这些任务设计、经过充分测试、被广泛使用并处理边缘情况的开源工具。例如,使用 EMBOSSinfoseq 实用程序。EMBOSS 可以轻松安装,例如使用 conda

示例:

安装 EMBOSS 包(仅需执行一次):

  1. conda create --name emboss emboss --channel iuc

激活 conda 环境并使用 EMBOSSinfoseq,这里用于打印序列名称、长度和 GC 含量:

  1. source activate emboss
  2. cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
  3. source deactivate

这将在 标准输出 中打印类似以下的内容:

  1. Name Length %GC
  2. seq_foo 119 60.50
  3. seq_bar 104 39.42
  4. seq_baz 191 46.60
  5. ...
英文:

Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, use EMBOSS infoseq utility. EMBOSS can be easily installed, for example using conda.

Example:

Install EMBOSS package (do once):

  1. conda create --name emboss emboss --channel iuc

Activate the conda environment and use EMBOSS infoseq, here to print the sequence name, length and percent GC:

  1. source activate emboss
  2. cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
  3. source deactivate

This prints into standard output something like this:

  1. Name Length %GC
  2. seq_foo 119 60.50
  3. seq_bar 104 39.42
  4. seq_baz 191 46.60
  5. ...

答案2

得分: -1

这应该有效:

  1. #!/usr/bin/env sh
  2. # 改编自 https://www.biostars.org/p/17680
  3. # 遇到错误时退出
  4. set -o errexit
  5. # 禁用未定义变量引用
  6. set -o nounset
  7. # ===================
  8. # 配置
  9. # ===================
  10. # Fasta 文件路径
  11. FASTA_FILE="file.fasta"
  12. # 小数点后的位数
  13. N_DIGITS=3
  14. # ===================
  15. # 日志记录
  16. # ===================
  17. # 致命错误日志消息
  18. fatal() {
  19. printf '[FATAL] %s\n' "$@" >&2
  20. exit 1
  21. }
  22. # 信息日志消息
  23. info() {
  24. printf '[INFO ] %s\n' "$@"
  25. }
  26. # ===================
  27. # 主程序
  28. # ===================
  29. {
  30. # 检查命令 'bc' 是否存在
  31. command -v bc > /dev/null 2>&1 || fatal "找不到命令 'bc'"
  32. # 检查文件是否存在
  33. [ -f "$FASTA_FILE" ] || fatal "找不到文件 '$FASTA_FILE'"
  34. # 统计序列的数量
  35. _n_sequences=$(grep --count '^>' "$FASTA_FILE")
  36. info "分析 $_n_sequences 个序列"
  37. [ "$_n_sequences" -ne 0 ] || fatal "找不到任何序列"
  38. # 移除序列的换行符
  39. _fasta_file_content=$(
  40. sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
  41. | tr --delete "\r\n" \
  42. | sed 's/$/#/' \
  43. | tr "#" "\n" \
  44. | sed '/^$/d'
  45. )
  46. # 变量
  47. _sequence=
  48. _a_count_total=0
  49. _c_count_total=0
  50. _g_count_total=0
  51. _t_count_total=0
  52. # 逐行读取
  53. while IFS= read -r _line; do
  54. # 检查是否为标题行
  55. if printf '%s\n' "$_line" | grep --quiet '^>'; then
  56. # 保存序列并继续
  57. _sequence=${_line#?}
  58. continue
  59. fi
  60. # 统计
  61. _a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
  62. _c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
  63. _g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
  64. _t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)
  65. # 将当前计数添加到总计
  66. _a_count_total=$((_a_count_total + _a_count))
  67. _c_count_total=$((_c_count_total + _c_count))
  68. _g_count_total=$((_g_count_total + _g_count))
  69. _t_count_total=$((_t_count_total + _t_count))
  70. # 计算 GC 含量
  71. _gc=$(
  72. printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
  73. "$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
  74. | bc
  75. )
  76. # 在小数点前添加 0
  77. _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
  78. info "序列 '$_sequence' 的 GC 含量: $_gc"
  79. done <<EOF
  80. $_fasta_file_content
  81. EOF
  82. # 总计数据
  83. info "腺嘌呤总计数: $_a_count_total"
  84. info "胞嘧啶总计数: $_c_count_total"
  85. info "鸟嘌呤总计数: $_g_count_total"
  86. info "胸腺嘧啶总计数: $_t_count_total"
  87. # 计算总的 GC 含量
  88. _gc=$(
  89. printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
  90. "$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
  91. | bc
  92. )
  93. # 在小数点前添加 0
  94. _gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
  95. info "GC 含量: $_gc"
  96. }

"统计序列的数量" 和 "移除序列的换行符" 代码改编自 https://www.biostars.org/p/17680

脚本只使用了基本命令,除了 bc 用于进行精确计算(请参阅 bc 安装)。

您可以通过修改“配置”部分中的变量来配置脚本。

由于您没有指明要哪个部分,因此计算了每个序列和整体的 GC 含量。因此,您可以删除不需要的部分 使用Bash脚本如何找到FASTA文件的GC含量? 尽管我没有生物信息学背景,但该脚本成功解析和分析了 fasta 文件。

英文:

This should work:

  1. #!/usr/bin/env sh
  2. # Adapted from https://www.biostars.org/p/17680
  3. # Fail on error
  4. set -o errexit
  5. # Disable undefined variable reference
  6. set -o nounset
  7. # ================
  8. # CONFIGURATION
  9. # ================
  10. # Fasta file path
  11. FASTA_FILE=&quot;file.fasta&quot;
  12. # Number of digits after decimal point
  13. N_DIGITS=3
  14. # ================
  15. # LOGGER
  16. # ================
  17. # Fatal log message
  18. fatal() {
  19. printf &#39;[FATAL] %s\n&#39; &quot;$@&quot; &gt;&amp;2
  20. exit 1
  21. }
  22. # Info log message
  23. info() {
  24. printf &#39;[INFO ] %s\n&#39; &quot;$@&quot;
  25. }
  26. # ================
  27. # MAIN
  28. # ================
  29. {
  30. # Check command &#39;bc&#39; exist
  31. command -v bc &gt; /dev/null 2&gt;&amp;1 || fatal &quot;Command &#39;bc&#39; not found&quot;
  32. # Check file exist
  33. [ -f &quot;$FASTA_FILE&quot; ] || fatal &quot;File &#39;$FASTA_FILE&#39; not found&quot;
  34. # Count number of sequences
  35. _n_sequences=$(grep --count &#39;^&gt;&#39; &quot;$FASTA_FILE&quot;)
  36. info &quot;Analyzing $_n_sequences sequences&quot;
  37. [ &quot;$_n_sequences&quot; -ne 0 ] || fatal &quot;No sequences found&quot;
  38. # Remove sequence wrapping
  39. _fasta_file_content=$(
  40. sed &#39;s/\(^&gt;.*$\)/#\1#/&#39; &quot;$FASTA_FILE&quot; \
  41. | tr --delete &quot;\r\n&quot; \
  42. | sed &#39;s/$/#/&#39; \
  43. | tr &quot;#&quot; &quot;\n&quot; \
  44. | sed &#39;/^$/d&#39;
  45. )
  46. # Vars
  47. _sequence=
  48. _a_count_total=0
  49. _c_count_total=0
  50. _g_count_total=0
  51. _t_count_total=0
  52. # Read line by line
  53. while IFS= read -r _line; do
  54. # Check if header
  55. if printf &#39;%s\n&#39; &quot;$_line&quot; | grep --quiet &#39;^&gt;&#39;; then
  56. # Save sequence and continue
  57. _sequence=${_line#?}
  58. continue
  59. fi
  60. # Count
  61. _a_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;A&#39; | wc --bytes)
  62. _c_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;C&#39; | wc --bytes)
  63. _g_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;G&#39; | wc --bytes)
  64. _t_count=$(printf &#39;%s\n&#39; &quot;$_line&quot; | tr --delete --complement &#39;T&#39; | wc --bytes)
  65. # Add current count to total
  66. _a_count_total=$((_a_count_total + _a_count))
  67. _c_count_total=$((_c_count_total + _c_count))
  68. _g_count_total=$((_g_count_total + _g_count))
  69. _t_count_total=$((_t_count_total + _t_count))
  70. # Calculate GC content
  71. _gc=$(
  72. printf &#39;scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n&#39; \
  73. &quot;$N_DIGITS&quot; &quot;$_a_count&quot; &quot;$_c_count&quot; &quot;$_g_count&quot; &quot;$_t_count&quot; \
  74. | bc
  75. )
  76. # Add 0 before decimal point
  77. _gc=&quot;$(printf &quot;%.${N_DIGITS}f\n&quot; &quot;$_gc&quot;)&quot;
  78. info &quot;Sequence &#39;$_sequence&#39; GC content: $_gc&quot;
  79. done &lt;&lt; EOF
  80. $_fasta_file_content
  81. EOF
  82. # Total data
  83. info &quot;Adenine total count: $_a_count_total&quot;
  84. info &quot;Cytosine total count: $_c_count_total&quot;
  85. info &quot;Guanine total count: $_g_count_total&quot;
  86. info &quot;Thymine total count: $_t_count_total&quot;
  87. # Calculate total GC content
  88. _gc=$(
  89. printf &#39;scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n&#39; \
  90. &quot;$N_DIGITS&quot; &quot;$_a_count_total&quot; &quot;$_c_count_total&quot; &quot;$_g_count_total&quot; &quot;$_t_count_total&quot; \
  91. | bc
  92. )
  93. # Add 0 before decimal point
  94. _gc=&quot;$(printf &quot;%.${N_DIGITS}f\n&quot; &quot;$_gc&quot;)&quot;
  95. info &quot;GC content: $_gc&quot;
  96. }

The "Count number of sequences" and "Remove sequence wrapping" codes are adapted from https://www.biostars.org/p/17680

The script uses only basic commands except for bc to do the precision calculation (See bc installation).

You can configure the script by modifying the variables in the CONFIGURATION section.

Because you haven't indicated which one you want, the GC content is calculated for both each sequence and the overall. Therefore, get rid of anything that isn't necessary 使用Bash脚本如何找到FASTA文件的GC含量?

Despite my lack of bioinformatics background, the script successfully parses and analyzes a fasta file.

huangapple
  • 本文由 发表于 2023年2月8日 15:58:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75382780.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定