2023年2月23日 23:07:39go评论73阅读模式

英文:

Bash-awk-parallel select process for each line of a huge file

问题

我正在尝试将非常大文件的不同行发送到不同的进程。为了展示我的问题，我正在构建一个玩具示例，其中有一个包含10个类别的文件，我想计算每个类别第二列的标准差（sd）。请记住，我的真实文件有数百万行非常长的行，而sd计算实际上是更复杂的计算。

步骤1：创建一个测试文件：

seq 1 1000 | awk '{print int(10*rand()),int(100*rand())}' > testfile

步骤2：根据第一列拆分文件（我想计算第一字段中不同值的第二列的方差）：

cat testfile | awk '{print $2 >> "file"$1}'

步骤3：

现在我可以并行计算每个方差：

for i in $(seq 0 9); do
    cat file$i | awk '{s+=$1;ss+=$1*$1}END{a=s/NR;print sqrt((ss-a*a)/NR)}' > sd$i &
done

所以我想做的是跳过file$i的部分，直接在读取初始文件时将我的数字发送到10个进程。

从某种意义上说，这有点像使用parallel，但不是将行块发送到进程，而是使用一个字段将特定行发送到特定进程。

为了让您更好地理解我所处理的最后数据，我有2377个类别的1300万行。每行有30000个字段，我正在使用一个特定的Bash命令进行统计。

请还帮我构建我的问题！

英文:

I am trying to send different lines of a very big file to different processes. So to show my problem I am building a toy example where I have a file with 10 categories and I want to compute the standard deviation (sd) of the second column for each category. Please keep in mind that my real file is millions of very long lines lines, and the sd computation is in fact a more complex computation.

STEP 1 building a test file :

seq 1 1000 | awk &#39;{print int(10*rand()),int(100*rand())}&#39; &gt; testfile

STEP 2 splitting according to column 1 (I want to compute the variance of the second column for the different values in the first field)

cat testfile | awk &#39;{print $2 &gt;&gt; &quot;file&quot;$1}&#39;

STEP 3

so now I can compute each variance in parallel

for i in $(seq 0 9); do
    cat file$i | awk &#39;{s+=$1;ss+=$1*$1}END{a=s/NR;print sqrt((ss-a*a)/NR)}&#39; &gt; sd$i &amp;
done

So what I would like to do is to skip the file$i part and to send directly to 10 processes my numbers while reading my initial file.

In a way it s a bit like using parallel but instead of sending blocks of lines to processes it s using a field to send some specific lines to specific processes.

To give an idea of the last data I had to deal with, I have 13 million lines in 2377 categories. each line have 30K fields on which I am making stats using a specific bash command

Please also help me formulate my question !

答案1

得分: 3

GNU Parallel有--bin选项用于此目的。

seq 1 100000 | awk '{print int(10*rand()),int(100*rand()),int(15*rand())}' > testfile

sd() {
    # sd 2 = 第2列的标准差
    awk '{s+=$2;ss+=$2*$2;}END{a=s/NR;print sqrt((ss-a*a)/NR)}';
}
export -f sd

# 在第1列上进行分箱，第2列上计算标准差
cat testfile | parallel -j10 --colsep ' ' --bin 1 --pipe --tagstring {%} sd 2
# 在第2列上进行分箱，第3列上计算标准差
cat testfile | parallel -j100 --colsep ' ' --bin 2 --pipe --tagstring {%} sd 3
# 在第3列上进行分箱，第1列上计算标准差
cat testfile | parallel -j15 --colsep ' ' --bin 3 --pipe --tagstring {%} sd 1

英文:

GNU Parallel has --bin for this.

seq 1 100000 | awk &#39;{print int(10*rand()),int(100*rand()),int(15*rand())}&#39; &gt; testfile

sd() {
    # sd 2 = sd of column 2                                                               
    awk &#39;{s+=$&#39;$1&#39;;ss+=$&#39;$1&#39;*$&#39;$1&#39;}END{a=s/NR;print sqrt((ss-a*a)/NR)}&#39;
}
export -f sd

# bin on col 1, sd on col 2
cat testfile | parallel -j10 --colsep &#39; &#39; --bin 1 --pipe --tagstring {%} sd 2
# bin on col 2, sd on col 3
cat testfile | parallel -j100 --colsep &#39; &#39; --bin 2 --pipe --tagstring {%} sd 3
# bin on col 3, sd on col 1
cat testfile | parallel -j15 --colsep &#39; &#39; --bin 3 --pipe --tagstring {%} sd 1

答案2

得分: 1

以下是您要翻译的内容：

# Parallelize stream processing using [tag:bash]

(Full `bash` script using `sed` at end of this post!)

## Using `sed` for *stream filtering*

&gt; a bit like using parallel but instead of sending blocks of lines to processes it s using a field to send some specific lines to specific processes.

In this use case: having to *filter* stream to distrubute to many subtasks, `sed` should be de quickest way (as `sed` is a lot lighter
then `perl` and `parallel` is a `perl` script.
Using `sed` will be **sensibly** quicker, lighter and will consume less of resources! Please look comparison at end of this!

Fist preparing `sed` command line:

&lt;!-- language: lang-bash --&gt;
    printf -v sedcmd &#39; -e \47};/^%d/{s/^. //;w/dev/fd/%d\47 %d&gt; &gt;(exec \
        awk \47{c++;s+=$1;ss+=$1*$1}END{a=s/c;print %d,sqrt((ss-a*a)/c)}\47) &#39; $(
        for ((i=0;i&lt;10;i++)) { echo $i $((i+4)) $((i+4)) $i  ; })

Then command is: `eval sed -n &quot;${sedcmd/\};} -e &#39;};&#39;&quot;`:

&lt;!-- language: lang-bash --&gt;
    eval sed -n &quot;${sedcmd/\};} -e &#39;};&#39;&quot; &lt;testfile

or

&lt;!-- language: lang-bash --&gt;
    eval sed -n &quot;${sedcmd/\};} -e &#39;};&#39;&quot; &lt;testfile | cat


Where `$sedcmd` look like:

&lt;!-- language: lang-bash --&gt;
    $ echo -- &quot;$sedcmd&quot;
    --  -e &#39;};/^0/{s/^. //;w/dev/fd/4&#39; 4&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 0,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^1/{s/^. //;w/dev/fd/5&#39; 5&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 1,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^2/{s/^. //;w/dev/fd/6&#39; 6&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 2,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^3/{s/^. //;w/dev/fd/7&#39; 7&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 3,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^4/{s/^. //;w/dev/fd/8&#39; 8&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 4,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^5/{s/^. //;w/dev/fd/9&#39; 9&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 5,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^6/{s/^. //;w/dev/fd/10&#39; 10&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 6,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^7/{s/^. //;w/dev/fd/11&#39; 11&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 7,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^8/{s/^. //;w/dev/fd/12&#39; 12&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 8,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^9/{s/^. //;w/dev/fd/13&#39; 13&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 9,sqrt((ss-a*a)/c)}&#39;) 

Where
 - `4&gt; &gt;(exec awk ...)` tell `bash` to generate a fd number `4` and run `awk`
 - `-e &quot;/^0/{s/^. //;w/dev/fd/4&quot; -e &quot;}&quot;` tell `sed` to drop first character of lines wich begin by `0` and send it to `fd/4`.

## `parallel.sh` full bash script (draft)

Here is a full parallelFiltering `bash` script using `sed`:

&lt;!-- language: lang-bash --&gt;
    #!/bin/bash
    # parallel.sh - bash script for filtering/parallelising using sed.
    # (C) 2023 Felix Hauri - felix@f-hauri.ch
    # Licensed under terms of GPL v3. www.gnu.org
    
    prog=${0##*/}
    usage() {
        cat &lt;&lt;-EOUsage
            Usage: $prog -t &lt;tags&gt; [-b &lt;re&gt;] [-a &lt;re&gt;] command args
              -h                 show this
              -t &lt;tags&gt;   coma separated liste of tags to send to separated tasks
                               or single tag, &#39;-t&#39; option could be submited multiple times
              -b &lt;re&gt;     sed regex to match before tags
              -a &lt;re&gt;     sed regex to match after tags
              command     Any command to be run once for each tag.
                            Special string &quot;&lt;

<details>
<summary>英文:</summary>

# Parallelize stream processing using [tag:bash]

(Full `bash` script using `sed` at end of this post!)

## Using `sed` for *stream filtering*

&gt; a bit like using parallel but instead of sending blocks of lines to processes it s using a field to send some specific lines to specific processes.

In this use case: having to *filter* stream to distrubute to many subtasks, `sed` should be de quickest way (as `sed` is a lot lighter
then `perl` and `parallel` is a `perl` script.
Using `sed` will be **sensibly** quicker, lighter and will consume less of resources! Please look comparison at end of this!

Fist preparing `sed` command line:

&lt;!-- language: lang-bash --&gt;
    printf -v sedcmd &#39; -e \47};/^%d/{s/^. //;w/dev/fd/%d\47 %d&gt; &gt;(exec \
        awk \47{c++;s+=$1;ss+=$1*$1}END{a=s/c;print %d,sqrt((ss-a*a)/c)}\47) &#39; $(
        for ((i=0;i&lt;10;i++)) { echo $i $((i+4)) $((i+4)) $i  ; })

Then command is: `eval sed -n &quot;${sedcmd/\};} -e &#39;};&#39;&quot;`:

&lt;!-- language: lang-bash --&gt;
    eval sed -n &quot;${sedcmd/\};} -e &#39;};&#39;&quot; &lt;testfile

or

&lt;!-- language: lang-bash --&gt;
    eval sed -n &quot;${sedcmd/\};} -e &#39;};&#39;&quot; &lt;testfile | cat


Where `$sedcmd` look like:

&lt;!-- language: lang-bash --&gt;
    $ echo -- &quot;$sedcmd&quot;
    --  -e &#39;};/^0/{s/^. //;w/dev/fd/4&#39; 4&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 0,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^1/{s/^. //;w/dev/fd/5&#39; 5&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 1,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^2/{s/^. //;w/dev/fd/6&#39; 6&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 2,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^3/{s/^. //;w/dev/fd/7&#39; 7&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 3,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^4/{s/^. //;w/dev/fd/8&#39; 8&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 4,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^5/{s/^. //;w/dev/fd/9&#39; 9&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 5,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^6/{s/^. //;w/dev/fd/10&#39; 10&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 6,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^7/{s/^. //;w/dev/fd/11&#39; 11&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 7,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^8/{s/^. //;w/dev/fd/12&#39; 12&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 8,sqrt((ss-a*a)/c)}&#39;)  -e &#39;};/^9/{s/^. //;w/dev/fd/13&#39; 13&gt; &gt;(exec \
        awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 9,sqrt((ss-a*a)/c)}&#39;) 

Where
 - `4&gt; &gt;(exec awk ...)` tell `bash` to generate a fd number `4` and run `awk`
 - `-e &quot;/^0/{s/^. //;w/dev/fd/4&quot; -e &quot;}&quot;` tell `sed` to drop first character of lines wich begin by `0` and send it to `fd/4`.

## `parallel.sh` full bash script (draft)

Here is a full parallelFiltering `bash` script using `sed`:

&lt;!-- language: lang-bash --&gt;
    #!/bin/bash
    # parallel.sh - bash script for filtering/parallelising using sed.
    # (C) 2023 Felix Hauri - felix@f-hauri.ch
    # Licensed under terms of GPL v3. www.gnu.org
    
    prog=${0##*/}
    usage() {
        cat &lt;&lt;-EOUsage
            Usage: $prog -t &lt;tags&gt; [-b &lt;re&gt;] [-a &lt;re&gt;] command args
              -h                 show this
              -t &lt;tags&gt;   coma separated liste of tags to send to separated tasks
                               or single tag, &#39;-t&#39; option could be submited multiple times
              -b &lt;re&gt;     sed regex to match before tags
              -a &lt;re&gt;     sed regex to match after tags
              command     Any command to be run once for each tag.
                            Special string &quot;&lt;RE&gt;&quot; will be replaced by current tag.
            EOUsage
    }
    die() {
        echo &gt;&amp;2 &quot;ERROR $prog: $*&quot;
        exit 1
    }

&lt;!-- language: lang-bash --&gt; 
    while getopts &quot;ht:a:b:&quot; opt; do
        case $opt in
            h ) usage; exit ;;
            t ) IFS=, read -a crttags &lt;&lt;&lt;&quot;$OPTARG&quot;
                tags+=(&quot;$crttags&quot;);;
            b ) before=$OPTARG ;;
            a ) after=$OPTARG ;;
            *) die Wrong argument. ;;
        esac
    done
    shift $((OPTIND-1))
    
    [[ -v tags ]] || die &quot;No tags submited&quot;
    (( $# )) || die &quot;No command submited&quot;
    
    sedcmd=&#39;&#39; paren=&#39;&#39;
    declare -i crtFd=4
    for re in &quot;${tags[@]}&quot;;do
        printf -v crtcmd &#39;%q &#39; &quot;${@//\&lt;RE\&gt;/$re}&quot;
        printf -v crtcmd &#39; -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d&gt; &gt;(exec %s) &#39; \
               &quot;$paren&quot; &quot;$before$re$after&quot;{,} $crtFd $crtFd &quot;$crtcmd&quot;
        paren=&#39;};&#39;
        sedcmd+=&quot;$crtcmd&quot; crtFd+=1
    done
    sedcmd+=&quot; -e &#39;$paren&#39;&quot;
    
    eval sed -n &quot;$sedcmd&quot; 


&gt;     Usage: parallel.sh -t &lt;tags&gt; [-b &lt;re&gt;] [-a &lt;re&gt;] command args
&gt;       -h            show this
&gt;       -t &lt;tags&gt;   coma separated liste of tags to send to separated tasks
&gt;                     or single tag, &#39;-t&#39; option could be submited multiple times
&gt;       -b &lt;re&gt;     sed regex to match before tags
&gt;       -a &lt;re&gt;     sed regex to match after tags
&gt;       command     Any command to be run once for each tag.
&gt;                     Special string &quot;&lt;RE&gt;&quot; will be replaced by current tag.

This script could be found there: [parallel.sh][1].

Tested with your use case with:

&lt;!-- language: lang-bash --&gt;
     ./parallel.sh -t{0..9} -b ^ awk &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print &lt;RE&gt;,sqrt((ss-a*a)/c)}&#39; &lt;testfile

Notice the only change from your command line is `print &lt;RE&gt;,sqrt...` where `&lt;RE&gt;` will be replaced by each tags (`-t`) in each subtask respectively.


&lt;!-- language: lang-none --&gt; 
    9 55.6751
    8 58.0447
    7 55.6755
    6 58.3663
    5 58.696
    4 58.2724
    3 54.9797
    2 57.5355
    1 54.6131
    0 57.1334

## Comparison with GNU `parallel`

Of course this is about *line buffered filtering*, not suitable for *block buffered* distribution!!

I&#39;ve tested with a simple 1000 lines random file:

&lt;!-- language: lang-bash --&gt;
    for ((i=1000;i--;)){ echo $((RANDOM%10)) $((RANDOM%100));} &gt;testfile

then using `parallel`:

&lt;!-- language: lang-bash --&gt;
    sd() {
      awk &#39;{s+=$&#39;$1&#39;;ss+=$&#39;$1&#39;*$&#39;$1&#39;}END{a=s/NR;print sqrt((ss-a*a)/NR)}&#39;
    }
    export -f sd
    time parallel -j10 --colsep &#39; &#39; --bin 1 --pipe \
        --tagstring {%} sd 2 &lt;testfile |sort 

&lt;!-- language: lang-none --&gt;
    10      58.3703
    1       50.7911
    2       56.9009
    3       55.0832
    4       52.5365
    5       65.0864
    6       61.4079
    7       55.5353
    8       62.337
    9       51.2512
    
    real    0m0.488s
    user    0m1.158s
    sys     0m0.272s

 and using `sed` + `bash`:

&lt;!-- language: lang-bash --&gt;
    time ./parallel.sh -t{0..9} -b ^ awk \
      &#39;{c++;s+=$1;ss+=$1*$1}END{a=s/c;print &lt;RE&gt;,sqrt((ss-a*a)/c)}&#39; &lt;testfile |
        sort

&lt;!-- language: lang-none --&gt;
    0 58.3703
    1 50.7911
    2 56.9009
    3 55.0832
    4 52.5365
    5 65.0864
    6 61.4079
    7 55.5353
    8 62.337
    9 51.2512
    
    real    0m0.010s
    user    0m0.009s
    sys     0m0.000s

Fortunately computed results are same! (`parallel` version output `10` instead of `0`).

Where `bash`+`sed` version
 - use *tags* instead of number
 - use a lot less system resources
 - is something quicker

### Test with bigger and smaller files:

&gt; &lt;!-- language: lang-none --&gt;
&gt;                        Number of lines     Real     User   System
&gt;     parallel.sh            100&#39;000&#39;000   73.117   72.598    0.416
&gt;     parallel (perl)        100&#39;000&#39;000  129.264  383.701   36.319
&gt;     
&gt;     parallel.sh              1&#39;000&#39;000    0.744    0.728    0.013
&gt;     parallel (perl)          1&#39;000&#39;000    1.798    5.571    0.613
&gt;     
&gt;     parallel.sh                 10&#39;000    0.018    0.007    0.009
&gt;     parallel (perl)             10&#39;000    0.523    1.148    0.269

Here are ouput of `ps --tty pts/4 fw` while `parallel.sh` was running in `pts/4`:

&gt; &lt;!-- language: lang-none --&gt;
&gt;        5352 pts/4    Ss     0:00 -bash
&gt;        5983 pts/4    S+     0:00  \_ /bin/bash ./parallel.sh -t0 -t1 -t2..
&gt;        5985 pts/4    R+     0:13  |   \_ sed -n -e /^0/{s/^0//;w/dev/fd/..
&gt;        5986 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5987 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5988 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5989 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5990 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5991 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5992 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5993 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5994 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5995 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
&gt;        5984 pts/4    S+     0:00  \_ sort


Where `bash` execute `sed` wich run 10x `awk`, piped to `sort`. Look&#39;s ok!

Here are ouput of `ps --tty pts/4 fw` while `parallel` (perl) was running:

&gt; &lt;!-- language: lang-none --&gt;
&gt;        5352 pts/4    Ss     0:00 -bash
&gt;        5777 pts/4    S+     0:00  \_ /usr/bin/perl /usr/bin/parallel -j1..
&gt;        5780 pts/4    R+     0:17  |   \_ /usr/bin/perl /usr/bin/parallel..
&gt;        5956 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
&gt;        5957 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
&gt;     snip 7 lines
&gt;        5965 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
&gt;        5793 pts/4    S      0:00  |   \_ /usr/bin/bash -c perl -e &#39;{use ..
&gt;        5794 pts/4    S      0:00  |   |   \_ perl -e {use POSIX qw(:errn..
&gt;        5795 pts/4    S      0:00  |   |   \_ /usr/bin/bash -c perl -e &#39;{..
&gt;        5796 pts/4    S      0:01  |   |       \_ awk {s+=$2;ss+=$2*$2}EN..
&gt;     snip 33 lines
&gt;        5852 pts/4    S      0:00  |   \_ /usr/bin/bash -c perl -e &#39;{use ..
&gt;        5867 pts/4    S      0:00  |       \_ perl -e {use POSIX qw(:errn..
&gt;        5868 pts/4    S      0:00  |       \_ /usr/bin/bash -c perl -e &#39;{..
&gt;        5870 pts/4    S      0:01  |           \_ awk {s+=$2;ss+=$2*$2}EN..
&gt;        5778 pts/4    S+     0:00  \_ sort

Well!! **52** process are executed to fork 10 time one stream to 10 subprocess!! Each subprocess require 5 sub tasks?!

## Use case:

Quick demo on *log file*:

&lt;!-- language: lang-bash --&gt;
    { tags=($(cut -d\[ -f 1 | cut -d\  -f 5 | sort -u)) ;} &lt;daemon.log 
    ./parallel.sh &quot;${tags[@]/#/-t}&quot; -b \\b -a \\[ bash -c \
        $&#39;printf \47 - %-20s  %8d %8d %8d\\n\47 &quot;$1&quot; $(wc)&#39; -- &quot;&lt;RE&gt;&quot; &lt;daemon.log  |
        sort

This will run as many task there are *tags*. Then ouput `wc` for each sub stream.

Note: Syntax: `${tags[@]/#/-t}` will be expanded as `-tdhclient -tdnsmasq -tssystemd ...`.

&lt;!-- language: lang-none --&gt;
     - accton                      14      154     1165
     - dbus-daemon                 80     1273    13731
     - dhclient                  6480    79920   542160
     - dnsmasq                   6480    49680   401760
     - systemd                 154608  1474418 10664639
    ...

But, you could create different filter for differents targets:

&lt;!-- language: lang-bash --&gt;
    tags=( dhclient dnsmasq systemd )
    ./parallel.sh ${tags[@]/#/-t} -b \\b -a \\[ \
            &quot;./filter-&lt;RE&gt;.sh&quot; &lt;daemon.log

Will run 3 different tasks: `./filter-dnsmasq.sh`, `./filter-dhclient.sh` and `./filter-systemd.sh`, then parse *log file* to send watched lines to specific task.

### Remark about *parellel* output:

Regarding [Ole Tange&#39;s comment][2], this seem clear to me: If you ask many task to speak together on ***same uniq* `STDOUT`**, you may observe stange mixed lines!

If your filter have to ouput continuously, you have to **drive** his ouput in proper way!

    ./parallel.sh -t{0..9} -b ^ -a &#39; &#39; &lt;testfile sh \
        -c &#39;sed &quot;s/^/group &lt;RE&gt; /&quot; &gt;/tmp/file-&lt;RE&gt;.txt&#39;
    cat /tmp/file-?.txt

Nota: I&#39;ve finally successfully made a test in the spirit of comments:&lt;br /&gt;`parellel.sh ... | sort | uniq -c...`.&lt;br /&gt;For this to run, script would by modified by adding *`fifos`* to be merged by a `cat` at end of script, something like:

&gt;     +++ parallel.sh   2023-03-07 19:17:08.976802098 +0100
&gt;     @@ -40,5 +40,8 @@
&gt;      for re in &quot;${tags[@]}&quot;;do
&gt;     +    pasteFd=/tmp/fifo-pp-r$crtFd
&gt;     +    mkfifo $pasteFd
&gt;     +    pasteFds+=($pasteFd)
&gt;          printf -v crtcmd &#39;%q &#39; &quot;${@//\&lt;RE\&gt;/$re}&quot;
&gt;     -    printf -v crtcmd &#39; -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d&gt; &gt;(exec %s) &#39; \
&gt;     -    &quot;$paren&quot; &quot;$before$re$after&quot;{,} $crtFd $crtFd &quot;$crtcmd&quot;
&gt;     +    printf -v crtcmd &#39; -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d&gt; &gt;(exec %s &gt;%s) &#39;\
&gt;     +    &quot;$paren&quot; &quot;$before$re$after&quot;{,} $crtFd $crtFd &quot;$crtcmd&quot; $pasteFd
&gt;          paren=&#39;};&#39;
&gt;     @@ -47,3 +50,4 @@
&gt;      sedcmd+=&quot; -e &#39;$paren&#39;&quot;
&gt;     -
&gt;     -eval sed -n &quot;$sedcmd&quot; 
&gt;     +eval sed -n &quot;$sedcmd&quot; &amp;
&gt;     +parcat ${pasteFds[@]}
&gt;     +rm ${pasteFds[@]}

I will probably add some options for making this properly in my final script (on my website).

  [1]: https://f-hauri.ch/vrac/parallel.sh.txt
  [2]: https://stackoverflow.com/questions/75546641/bash-awk-parallel-select-process-for-each-line-of-a-huge-file/75548027?noredirect=1#comment133485356_75548027

</details>



# 答案3
**得分**: 1

使用`awk`来实现并行处理吗？

```shell
( time ( nice mawk2 -v ___=&#39;98766669&#39; &#39;
              BEGIN { srand()
                    srand()
                  CONVFMT = OFMT = &quot;%.250g&quot;
                  __ = (_+=_^=_&lt;_)+_^++_
                 ___*=(_=__^(++_+_))^!_;
                   _ = 61277761 * 65537

                  while(___--) { print int(__*rand()) %__,
                                       int( _*rand())  } }&#39; | pvE0 |

  mawk2 &#39;
  BEGIN { 
      ___ += ___ = _^= SUBSEP = &quot;&quot;
      CONVFMT = OFMT = &quot;%.250g&quot; 
  } { 
      __ = $(_ =  ___)
      if ( ((_ = $--_) in _____)==(_&lt;_) ) { 
          
          _____[_] = sprintf(&quot; Grp =[ %15s ]= &quot;,_)
      } 
      ____[_&quot;|&quot;]++            # 计数
      ____[_&quot;]&quot;]+= __         # 总和
      ____[_&quot;[&quot;]+= __*__      # 平方和

  } END { 
      for (______ = _&lt;_; ______!~&quot;..&quot;; ______++) {
         _ = ______
         printf(&quot; %s\f\r\t\t| %23.f #\f\r\t\t| &quot;\
                &quot;%37.13f 平均值\f\r\t\t| %37.13f 标准差\n&quot;,
                         _____[_], 
                 ___=__ = ____[_&quot;|&quot;], 
                     __ = ____[_&quot;]&quot;] * (___^= -_^(_&lt;_)),          # 反转计数 
                 (___ * ( ____[_&quot;[&quot;] -  __^(_+=_^=_&lt;_)))^_^-_^!_) # n^(1/2) == 平方根 

  } }&#39; ) )

在此awk脚本中，它首先生成了大量随机整数，并根据分组对这些整数进行聚合并计算相关统计数据。

Grp =[               0 ]= 
|                 9874382 #
|           2007927772624.7209472656250 平均值
|           2318542792000.9663085937500 标准差
Grp =[               1 ]= 
|                 9877831 #
|           2007986083790.7338867187500 平均值
|           2318611292854.9780273437500 标准差
Grp =[               2 ]= 
|                 9873525 #
|           2008346714329.9284667968750 平均值
|           2318968400134.7607421875000 标准差
Grp =[               3 ]= 
|                 9877464 #
|           2007416025675.7121582031250 平均值
|           2318105286630.4780273437500 标准差
Grp =[               4 ]= 
|                 9878524 #
|           2007843011030.0527343750000 平均值
|           2318514145523.2456054687500 标准差
Grp =[               5 ]= 
|                 9875712 #
|           2008091963744.2180175781250 平均值
|           2318593468063.0859375000000 标准差
Grp =[               6 ]= 
|                 9875784 #
|           2008134171756.2131347656250 平均值
|           2318721989221.3188476562500 标准差
Grp =[               7 ]= 
|                 9881377 #
|           2007915282626.9929199218750 平均值
|           2318585484730.8193359375000 标准差
Grp =[               8 ]= 
|                 9877341 #
|           2008109181888.2607421875000 平均值
|           2318760855885.9106445312500 标准差
Grp =[               9 ]= 
|                 9874729 #
|           2008153989929.5683593750000 平均值
|           2318791539162.9497070312500 标准差

总共生成了98.7百万行随机整数，并在不到29.7秒内完成了所有处理。

英文:

with awk do you really need to make it parallel at all ?

( time ( nice mawk2 -v ___=&#39;98766669&#39; &#39;
BEGIN { srand()
srand()
CONVFMT = OFMT = &quot;%.250g&quot;
__ = (_+=_^=_&lt;_)+_^++_
___*=(_=__^(++_+_))^!_;
_ = 61277761 * 65537
while(___--) { print int(__*rand()) %__,
int( _*rand())  } }&#39; | pvE0 |
mawk2 &#39;
BEGIN { 
___ += ___ = _^= SUBSEP = &quot;&quot;
CONVFMT = OFMT = &quot;%.250g&quot; 
} { 
__ = $(_ =  ___)
if ( ((_ = $--_) in _____)==(_&lt;_) ) { 
_____[_] = sprintf(&quot; Grp =[ %15s ]= &quot;,_)
} 
____[_&quot;|&quot;]++            # counter
____[_&quot;]&quot;]+= __         # sum
____[_&quot;[&quot;]+= __*__      # sum of squares
} END { 
for (______ = _&lt;_; ______!~&quot;..&quot;; ______++) {
_ = ______
printf(&quot; %s\f\r\t\t| %23.f #\f\r\t\t| &quot;\
&quot;%37.13f avg.\f\r\t\t| %37.13f st.dv.\n&quot;,
_____[_], 
___=__ = ____[_&quot;|&quot;], 
__ = ____[_&quot;]&quot;] * (___^= -_^(_&lt;_)),          # inverting counter 
(___ * ( ____[_&quot;[&quot;] -  __^(_+=_^=_&lt;_)))^_^-_^!_) # n^(1/2) == sqrt 
} }&#39; ) )

      in0: 1.45GiB 0:00:29 [49.9MiB/s] [49.9MiB/s] [ &lt;=&gt;                       ]
Grp =[               0 ]= 
|                 9874382 #
|           2007927772624.7209472656250 avg.
|           2318542792000.9663085937500 st.dv.
Grp =[               1 ]= 
|                 9877831 #
|           2007986083790.7338867187500 avg.
|           2318611292854.9780273437500 st.dv.
Grp =[               2 ]= 
|                 9873525 #
|           2008346714329.9284667968750 avg.
|           2318968400134.7607421875000 st.dv.
Grp =[               3 ]= 
|                 9877464 #
|           2007416025675.7121582031250 avg.
|           2318105286630.4780273437500 st.dv.
Grp =[               4 ]= 
|                 9878524 #
|           2007843011030.0527343750000 avg.
|           2318514145523.2456054687500 st.dv.
Grp =[               5 ]= 
|                 9875712 #
|           2008091963744.2180175781250 avg.
|           2318593468063.0859375000000 st.dv.
Grp =[               6 ]= 
|                 9875784 #
|           2008134171756.2131347656250 avg.
|           2318721989221.3188476562500 st.dv.
Grp =[               7 ]= 
|                 9881377 #
|           2007915282626.9929199218750 avg.
|           2318585484730.8193359375000 st.dv.
Grp =[               8 ]= 
|                 9877341 #
|           2008109181888.2607421875000 avg.
|           2318760855885.9106445312500 st.dv.
Grp =[               9 ]= 
|                 9874729 #
|           2008153989929.5683593750000 avg.
|           2318791539162.9497070312500 st.dv.
( nice mawk2 -v ___=&#39;98766669&#39;  | pvE 0.1 in0 | mawk2 ; )  
50.38s user 0.99s system 172% cpu 29.700 total

It only took awk merely 29.7 secs end-to-end to generate 98.7 million rows of random integers between 0 (inclusive) and the composite of these 2 primes (exclusive) -

>+ 65,537 : 1 + 4^8
>+ 61,277,761 : 8^8, digit reversed

then aggregating the 1.45 GB output from step 1 and calculating the relevant stats per group.

答案4

得分: 0

如果我理解你的需求，这可能会有所帮助

seq 1 1000 | awk '{print int(10*rand()),int(100*rand())}' \
| awk '{
    sArr[$1]+=$2; sCnt[$1]++;
    ssArr[$1]+=$2*$2
  }
  END{
    for(i in sCnt){
      if ( dbg) { print "processing group "i }
      a=sArr[i]/sCnt[i]
      print sqrt((ssArr[i]-a*a)/sCnt[i]) > "sd"i
      close("sd"i)
    }
  }'

但是，我着急而且我的数学技能很生疏，所以要持怀疑态度 (-; 感谢markp-fuso改进注释！

与其使用$1的值来编写一个中间文件，不如将其用作存储感兴趣的值的数组的键。即使你正在读取非常大的文件，这也会将内存使用量减少到仅存储内存中所需的字段。

P.S. 在多个地方看到连续字符(\)后的空格将导致错误。

$ ls -l sd*
-rw-r--r-- 1 Neil_2 None 8 Feb 23 09:51 sd0
-rw-r--r-- 1 Neil_2 None 8 Feb 23 09:51 sd1
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd2
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd3
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd4
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd5
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd6
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd7
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd8
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd9

$ awk '{print FILENAME " " $0}' sd*
sd0 58.8944
sd1 57.7931
sd2 43.1367
sd3 45.3593
sd4 69.5813
sd5 58.2163
sd6 39.3107
sd7 53.4005
sd8 60.4184
sd9 65.4446

英文:

If I understand your requirement, this may help

seq 1 1000 | awk &#39;{print int(10*rand()),int(100*rand())}&#39; \
| awk &#39;{
sArr[$1]+=$2; sCnt[$1]++;
ssArr[$1]+=$2*$2
}
END{
for(i in sCnt){
if ( dbg) { print &quot;processing group &quot;i }
a=sArr[i]/sCnt[i]
print sqrt((ssArr[i]-a*a)/sCnt[i]) &gt; &quot;sd&quot;i
close(&quot;sd&quot;i)
}
}&#39;

But, I'm in a rush and my math skills are very rusty, so take with a grain of salt (-; Thanks to markp-fuso for improving comments!

Instead of using the $1 value to write an intermediate file, use it as a key for storing values of interest to an array. Even though you're reading very large files, this pares the memory usage down to just the fields you need being stored in memory.

P.S. Whitespace after the continutation character seen in several places (\) will cause an error.

$ ls -l sd*
-rw-r--r-- 1 Neil_2 None 8 Feb 23 09:51 sd0
-rw-r--r-- 1 Neil_2 None 8 Feb 23 09:51 sd1
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd2
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd3
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd4
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd5
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd6
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd7
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd8
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd9
$ awk &#39;{print FILENAME &quot; &quot; $0}&#39; sd*
sd0 58.8944
sd1 57.7931
sd2 43.1367
sd3 45.3593
sd4 69.5813
sd5 58.2163
sd6 39.3107
sd7 53.4005
sd8 60.4184
sd9 65.4446

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Bash-awk-parallel 为大文件的每一行选择进程

问题

答案1

答案2

答案4

bash: 文件名中包含空格时会被截断

TicTacToe Bash脚本

如何在bash中防止嵌套函数失败被忽略

Awk如何打印特定表达式的下一行，但仅当它包含数字时。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论