英文:
How can I find the GC-content of a FASTA file using a Bash script?
问题
以下是翻译好的部分:
"I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).
I am trying to use the wc command. But I was not able to get an answer.
After going through documentation and videos, I came up with a solution.
filename=$@ # Collecting all the filenames as parameters
for f in $filename # Looping over files
do
echo " $f is being processed..."
gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l)) # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c)) # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
percent=( $( echo "scale=2;100*$gc/$total" |bc -l)) # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
echo " The GC content of $f is: "$percent"%"
echo
done
I am learning bioinformatics.
英文:
I want to find the GC-content from a FASTA format format file using a Bash script.
The GC-content is basically (number of (G + C)) / (number of (A + T + G + C)).
I am trying to use the wc command. But I was not able to get an answer.
After going through documentation and videos, I came up with a solution.
filename=$@ # Collecting all the filenames as parameters
for f in $filename # Looping over files
do
echo " $f is being processed..."
gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l)) # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c)) # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
percent=( $( echo "scale=2;100*$gc/$total" |bc -l)) # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
echo " The GC content of $f is: "$percent"%"
echo
done
I am learning bioinformatics.
答案1
得分: 1
不要重新发明轮子。对于常见的生物信息学任务,请使用专为这些任务设计、经过充分测试、被广泛使用并处理边缘情况的开源工具。例如,使用 EMBOSS
的 infoseq
实用程序。EMBOSS
可以轻松安装,例如使用 conda
。
示例:
安装 EMBOSS
包(仅需执行一次):
conda create --name emboss emboss --channel iuc
激活 conda
环境并使用 EMBOSS
的 infoseq
,这里用于打印序列名称、长度和 GC 含量:
source activate emboss
cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
source deactivate
这将在 标准输出 中打印类似以下的内容:
Name Length %GC
seq_foo 119 60.50
seq_bar 104 39.42
seq_baz 191 46.60
...
英文:
Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, use EMBOSS
infoseq
utility. EMBOSS
can be easily installed, for example using conda
.
Example:
Install EMBOSS
package (do once):
conda create --name emboss emboss --channel iuc
Activate the conda
environment and use EMBOSS
infoseq
, here to print the sequence name, length and percent GC:
source activate emboss
cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
source deactivate
This prints into standard output something like this:
Name Length %GC
seq_foo 119 60.50
seq_bar 104 39.42
seq_baz 191 46.60
...
答案2
得分: -1
这应该有效:
#!/usr/bin/env sh
# 改编自 https://www.biostars.org/p/17680
# 遇到错误时退出
set -o errexit
# 禁用未定义变量引用
set -o nounset
# ===================
# 配置
# ===================
# Fasta 文件路径
FASTA_FILE="file.fasta"
# 小数点后的位数
N_DIGITS=3
# ===================
# 日志记录
# ===================
# 致命错误日志消息
fatal() {
printf '[FATAL] %s\n' "$@" >&2
exit 1
}
# 信息日志消息
info() {
printf '[INFO ] %s\n' "$@"
}
# ===================
# 主程序
# ===================
{
# 检查命令 'bc' 是否存在
command -v bc > /dev/null 2>&1 || fatal "找不到命令 'bc'"
# 检查文件是否存在
[ -f "$FASTA_FILE" ] || fatal "找不到文件 '$FASTA_FILE'"
# 统计序列的数量
_n_sequences=$(grep --count '^>' "$FASTA_FILE")
info "分析 $_n_sequences 个序列"
[ "$_n_sequences" -ne 0 ] || fatal "找不到任何序列"
# 移除序列的换行符
_fasta_file_content=$(
sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
| tr --delete "\r\n" \
| sed 's/$/#/' \
| tr "#" "\n" \
| sed '/^$/d'
)
# 变量
_sequence=
_a_count_total=0
_c_count_total=0
_g_count_total=0
_t_count_total=0
# 逐行读取
while IFS= read -r _line; do
# 检查是否为标题行
if printf '%s\n' "$_line" | grep --quiet '^>'; then
# 保存序列并继续
_sequence=${_line#?}
continue
fi
# 统计
_a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
_c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
_g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
_t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)
# 将当前计数添加到总计
_a_count_total=$((_a_count_total + _a_count))
_c_count_total=$((_c_count_total + _c_count))
_g_count_total=$((_g_count_total + _g_count))
_t_count_total=$((_t_count_total + _t_count))
# 计算 GC 含量
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
| bc
)
# 在小数点前添加 0
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "序列 '$_sequence' 的 GC 含量: $_gc"
done <<EOF
$_fasta_file_content
EOF
# 总计数据
info "腺嘌呤总计数: $_a_count_total"
info "胞嘧啶总计数: $_c_count_total"
info "鸟嘌呤总计数: $_g_count_total"
info "胸腺嘧啶总计数: $_t_count_total"
# 计算总的 GC 含量
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
| bc
)
# 在小数点前添加 0
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "GC 含量: $_gc"
}
"统计序列的数量" 和 "移除序列的换行符" 代码改编自 https://www.biostars.org/p/17680
脚本只使用了基本命令,除了 bc 用于进行精确计算(请参阅 bc 安装)。
您可以通过修改“配置”部分中的变量来配置脚本。
由于您没有指明要哪个部分,因此计算了每个序列和整体的 GC 含量。因此,您可以删除不需要的部分 尽管我没有生物信息学背景,但该脚本成功解析和分析了 fasta 文件。
英文:
This should work:
#!/usr/bin/env sh
# Adapted from https://www.biostars.org/p/17680
# Fail on error
set -o errexit
# Disable undefined variable reference
set -o nounset
# ================
# CONFIGURATION
# ================
# Fasta file path
FASTA_FILE="file.fasta"
# Number of digits after decimal point
N_DIGITS=3
# ================
# LOGGER
# ================
# Fatal log message
fatal() {
printf '[FATAL] %s\n' "$@" >&2
exit 1
}
# Info log message
info() {
printf '[INFO ] %s\n' "$@"
}
# ================
# MAIN
# ================
{
# Check command 'bc' exist
command -v bc > /dev/null 2>&1 || fatal "Command 'bc' not found"
# Check file exist
[ -f "$FASTA_FILE" ] || fatal "File '$FASTA_FILE' not found"
# Count number of sequences
_n_sequences=$(grep --count '^>' "$FASTA_FILE")
info "Analyzing $_n_sequences sequences"
[ "$_n_sequences" -ne 0 ] || fatal "No sequences found"
# Remove sequence wrapping
_fasta_file_content=$(
sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
| tr --delete "\r\n" \
| sed 's/$/#/' \
| tr "#" "\n" \
| sed '/^$/d'
)
# Vars
_sequence=
_a_count_total=0
_c_count_total=0
_g_count_total=0
_t_count_total=0
# Read line by line
while IFS= read -r _line; do
# Check if header
if printf '%s\n' "$_line" | grep --quiet '^>'; then
# Save sequence and continue
_sequence=${_line#?}
continue
fi
# Count
_a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
_c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
_g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
_t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)
# Add current count to total
_a_count_total=$((_a_count_total + _a_count))
_c_count_total=$((_c_count_total + _c_count))
_g_count_total=$((_g_count_total + _g_count))
_t_count_total=$((_t_count_total + _t_count))
# Calculate GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "Sequence '$_sequence' GC content: $_gc"
done << EOF
$_fasta_file_content
EOF
# Total data
info "Adenine total count: $_a_count_total"
info "Cytosine total count: $_c_count_total"
info "Guanine total count: $_g_count_total"
info "Thymine total count: $_t_count_total"
# Calculate total GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "GC content: $_gc"
}
The "Count number of sequences" and "Remove sequence wrapping" codes are adapted from https://www.biostars.org/p/17680
The script uses only basic commands except for bc to do the precision calculation (See bc installation).
You can configure the script by modifying the variables in the CONFIGURATION
section.
Because you haven't indicated which one you want, the GC content is calculated for both each sequence and the overall. Therefore, get rid of anything that isn't necessary
Despite my lack of bioinformatics background, the script successfully parses and analyzes a fasta file.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论