合并多个文件,基于正则表达式。

huangapple go评论93阅读模式
英文:

Merge multiple files based on regex

问题

如何使用for循环合并具有匹配字符/数字的两个带下划线的文件?目录中有许多文件

输入:

SRR9200887_1.fastq
SRR9200887_2.fastq
SRR9200888_1.fastq
SRR9200888_2.fastq
SRR9200889_1.fastq
SRR9200889_2.fastq

期望输出:

SRR9200887.fastq
SRR9200888.fastq
SRR9200889.fastq

我的尝试:

for l in $(ls *.fastq | cut -d_ -f1 | sort | uniq); do cat ${l}*.fastq
英文:

How do I merge two files with matching characters/digits before an underscore using a for loop? I have many files in the directory

Input:

SRR9200887_1.fastq
SRR9200887_2.fastq
SRR9200888_1.fastq
SRR9200888_2.fastq
SRR9200889_1.fastq
SRR9200889_2.fastq

Expected output:

SRR9200887.fastq
SRR9200888.fastq
SRR9200889.fastq

My attempt:

for l in $(ls *.fastq | cut -d_ -f1 | sort |uniq); do cat ${l}*.fastq

答案1

得分: 5

With bash 和它的 Parameter Expansion

for i in *_1.fastq; do
  cat "${i%_*.fastq}_1.fastq" "${i%_*.fastq}_2.fastq" > "${i%_*.fastq}.fastq";
done

${i%_*.fastq} 输出 $i 但不包含 _ 及其后的部分,例如 SRR9200887

英文:

With bash and its Parameter Expansion:

for i in *_1.fastq; do
  cat "${i%_*.fastq}_1.fastq" "${i%_*.fastq}_2.fastq" > "${i%_*.fastq}.fastq";
done

${i%_*.fastq} outputs $i without _ and all following it, e.g. SRR9200887.

答案2

得分: 3

for f in *_*.fastq; do cat "$f" >> "${f%_*}.fastq"; done
英文:
for f in *_*.fastq; do cat "$f" >> "${f%_*}.fastq"; done

答案3

得分: 2

为了将文件合并在一起,假设您有匹配的"_1.fastq"和"_2.fastq"文件,每个"SRR"对应一个,一个潜在的选项是:

SRR_array=(*_1.fastq)
for f in "${SRR_array[@]%%_*}"
do
    cat "$f"_1.fastq "$f"_2.fastq > "$f".fastq
done

如果您想在合并后删除"_1.fastq"和"_2.fastq"文件:

SRR_array=(*_1.fastq)
for f in "${SRR_array[@]%%_*}"
do
    cat "$f"_1.fastq "$f"_2.fastq > "$f".fastq
    rm "$f"_1.fastq "$f"_2.fastq
done
英文:

To cat the files together, assuming you have matching "_1.fastq" and "_2.fastq" for every "SRR", one potential option is:

SRR_array=(*_1.fastq)
for f in "${SRR_array[@]%%_*}"
do
    cat "$f"_1.fastq "$f"_2.fastq > "$f".fastq
done

If you wanted to delete the _1.fastq and _2.fastq files after merging them together:

SRR_array=(*_1.fastq)
for f in "${SRR_array[@]%%_*}"
do
    cat "$f"_1.fastq "$f"_2.fastq > "$f".fastq
    rm "$f"_1.fastq "$f"_2.fastq
done

答案4

得分: 1

One bash idea:

while read -r pfx
do
    cat "${pfx}"_*.fastq >> "${pfx}".fastq
done < <(find . -name "*_*.fastq" | cut -d'_' -f1 | sort -u)

Tweaking OP's current code:

for l in $(ls -1 *_*.fastq | cut -d_ -f1 | sort | uniq)
do
    cat ${l}_*.fastq >> "${l}".fastq
done

Where:

  • we look for files with a _ in the name; if the script is run more than once this will insure we don't pick up the previous concatenated files
  • make sure ls lists one file per line (hence the -1)
  • in this case sort | uniq could be replaced with sort -u
英文:

One bash idea:

while read -r pfx
do
    cat &quot;${pfx}&quot;_*.fastq &gt;&gt; &quot;${pfx}&quot;.fastq
done &lt; &lt;(find . -name &quot;*_*.fastq&quot; | cut -d&#39;_&#39; -f1 | sort -u)

Tweaking OP's current code:

for l in $(ls -1 *_*.fastq | cut -d_ -f1 | sort | uniq)
do
    cat ${l}_*.fastq &gt;&gt; &quot;${l}&quot;.fastq
done

Where:

  • we look for files with a _ in the name; if the script is run more than once this will insure we don't pick up the previous concatenated files
  • make sure ls lists one file per line (hence the -1)
  • in this case sort | uniq could be replaced with sort -u

答案5

得分: 0

使用任何awk(未经测试):

    FNR==1 {
        out = FILENAME
        sub(/_[0-9]+/,"",out)
        if ( out != prev ) {
            close(prev)
            prev = out
        }
    }
    { print > out }
英文:

Using any awk (untested):

awk &#39;
    FNR==1 {
        out = FILENAME
        sub(/_[0-9]+/,&quot;&quot;,out)
        if ( out != prev ) {
            close(prev)
            prev = out
        }
    }
    { print &gt; out }
&#39; *_*.fastq

That will concatenate files with the same suffix no matter how many files have the same suffix, not just 2.

huangapple
  • 本文由 发表于 2023年4月4日 07:24:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75924408.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定