Which efficient & portable shell statement on GNU/Linux can zero-pad piped bytes to word boundary?

huangapple go评论57阅读模式
英文:

Which efficient & portable shell statement on GNU/Linux can zero-pad piped bytes to word boundary?

问题

我需要在字节流的末尾填充NUL字节,使得输出长度可以被N整除,以防超出可用存储和内存。我正在实现的函数背景如下:
#!/bin/sh
generate_arbitrary_length | paddingN | work_with_padded

对于N=8192的有效代码如下:
padding8192(){ dd status=none bs=8192 conv=sync ; }

但是对于小的N值来说,减小复制块的大小会*大幅度*减慢速度,这个方法并不适用:
padding4(){ dd status=none bs=4 conv=sync ; }

我可以使用`wc``dd`来表达计数和填充,通过复制输入流的方式:

padding4(){ { { tee /dev/fd/3 >&2 ; } 3>&1 | wc -c | { read -r isize ; pad=$(( 4 - isize % 4)) ; [ 0 -lt $pad ] && dd status=none if=/dev/zero bs=$pad count=1 >&2 ; } } 2>&1 ; }

这个方法已经快了很多。但是非常难以阅读——谁能说清楚为什么填充最终会到达文件结束符(EOF)?

有更好的方法吗?

尽管我只需要保留尽量少的状态来存储字节数除以字大小的余数,但我无法想出一个简单而又高效的shell内建实现方式。依赖应保持最小化:使用GNU coreutils/cpio/tar,不使用会在busybox/dash/bash之间有差异的编译器/perl/特性。我还没有想出一个`awk`解决方案,因为我无法使其在二进制输入上表现良好(每秒的字节数),而且输入并非等量的NL/NUL分隔成行。
英文:

I need to pad NUL bytes at the end of a byte stream exceeding available storage & memory, so output length is divisible by N. Context of the function I am implementing:

#!/bin/sh
generate_arbitrary_length | paddingN | work_with_padded

Working code for N=8192:

padding8192(){ dd status=none bs=8192 conv=sync ; }

But reducing copy block size is orders of magnitude slower for small N, this did not finish:

padding4(){ dd status=none bs=4 conv=sync ; }

I can express the counting & padding using wc and dd, after duplicating the input stream:

padding4(){ { { tee /dev/fd/3 >&2 ; } 3>&1 | wc -c | { read -r isize ; pad=$(( 4 - isize % 4)) ; [ 0 -lt $pad ] && dd status=none if=/dev/zero bs=$pad count=1 >&2 ; } } 2>&1 ; }

Much faster already. But very difficult to read - who could even tell why padding ends up at EOF?

Any better approach?

Though I only need to keep as much state as needed to store byte count modulo word size, I cannot think of a simple yet performant implementation using shell builtins. Dependencies should remain minimal: using GNU coreutils/cpio/tar, no compiler/perl/features that would differ between busybox/dash/bash. I have not come up with an awk solution as I failed to make it perform well (G/s) on binary input not evenly NL/NUL-separated into lines.

答案1

得分: 1

由于你提到有一个编译器可用,这里有一个小巧、便携的C程序。它不会变得更快,也不会占用更多内存。对于大多数编程社区的人来说,这个程序甚至是可读的。如果不是的话,你总是可以撒上 /* 注释!*/。:-)

#!/bin/sh
#
# pad.sh - 对输入进行填充,从标准输入读取大块数据,写入标准输出。

# 填充 $1:填充字符 $2:对齐 $3:块大小
padding () {
aout="./a$$.out"
cc -x c -o "$aout" - <<EOF
#include <stdio.h>

int main (void) {
  size_t align = $2, nwritten = 0, nread;
  char buffer[$3];
  while ((nread = fread (buffer, 1, sizeof buffer, stdin)) > 0)
    nwritten += fwrite (buffer, 1, nread, stdout);
  if ((nwritten % align) != 0)
    for (align -= nwritten % align; align != 0; --align)
      putchar ($1);
  return 0;
}
EOF
"$aout" && rm "$aout"
}

printf '%s' 123456789 | padding 0       4 16384                  | od -c
printf '%s' abcdefghi | padding "'\n'" 16 BUFSIZ                 | od -c
printf '%s' PAGE_SIZE | padding 65     32 "$(getconf PAGE_SIZE)" | od -c

在运行时:

$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  
$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
0000014 0000000 a b c d e f g h i \n \n \n \n \n \n \n 0000020 0000000 P A G E _ S I Z E A A A A A A A 0000020 A A A A A A A A A A A A A A A A 0000040

如果你担心非POSIX编译器选项 -x c,你可以轻松地将C程序写入 pad.c,然后从那里编译它。高级的 fwritefreadputchar 的错误处理留给读者。

请注意这个文档避免了 main 函数需要解析参数。你甚至可以传递像 PAGE_SIZE 这样的字符串,如果你的stdio默认提供了它们。

我刚意识到像这样编译C程序并没有太大区别于一个巧妙的awk脚本 -- awk也会编译内部程序然后执行它。还有什么比编译到机器的CPU上并运行可执行文件更好的呢?

英文:

Since you mention there's a compiler available, here's a tiny, portable C program. It does not get any faster and memory-economic. It's even readable for most people in the programming community. If not, you can always sprinkle /* Comments! */. Which efficient & portable shell statement on GNU/Linux can zero-pad piped bytes to word boundary?

#!/bin/sh
#
# pad.sh - pad input, reading in large blocks from stdin, writing stdout.

# padding $1:padchar $2:alignment $3:blocksize
padding () {
aout=&quot;./a$$.out&quot;
cc -x c -o &quot;$aout&quot; - &lt;&lt;EOF
#include &lt;stdio.h&gt;

int main (void) {
  size_t align = $2, nwritten = 0, nread;
  char buffer[$3];
  while ((nread = fread (buffer, 1, sizeof buffer, stdin)) &gt; 0)
    nwritten += fwrite (buffer, 1, nread, stdout);
  if ((nwritten % align) != 0)
    for (align -= nwritten % align; align != 0; --align)
      putchar ($1);
  return 0;
}
EOF
&quot;$aout&quot; &amp;&amp; rm &quot;$aout&quot;
}

printf &#39;%s&#39; 123456789 | padding 0       4 16384                  | od -c
printf &#39;%s&#39; abcdefghi | padding &quot;&#39;\n&#39;&quot; 16 BUFSIZ                 | od -c
printf &#39;%s&#39; PAGE_SIZE | padding 65     32 &quot;$(getconf PAGE_SIZE)&quot; | od -c

In action:

$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  
$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
0000014 0000000 a b c d e f g h i \n \n \n \n \n \n \n 0000020 0000000 P A G E _ S I Z E A A A A A A A 0000020 A A A A A A A A A A A A A A A A 0000040

If you are concerned about the non-POSIXly compiler option -x c you can easily write the C program to pad.c and compile it from there. Advanced error handling for fwrite, fread and putchar left to the reader.

Note how the here-document avoids main having to parse arguments. You can even pass strings like PAGE_SIZE if your stdio makes them available by default.

I just realized that compiling C like this is not much different from a nifty awk script -- awk also compiles an internal program and then executes it. What's better than compiling to the machine's CPU and running the executable?

答案2

得分: 0

严格遵循POSIX规范的做法是使用临时文件。

    padding() (
       tmpf=$(mktemp) &&
       trap 'rm "$tmpf"' EXIT &&
       tee "$tmpf" &&
       isize=$(wc -c <"$tmpf") &&
       pad=$(( $1 - isize % $1 )) &&
       if [ "$pad" -ne 0 ]; then
           dd status=none if=/dev/zero bs="$pad" count=1
       fi
    )
英文:

The POSIX thing to do would be to use a temporary file.

padding() (
   tmpf=$(mktemp) &amp;&amp;
   trap &#39;rm &quot;$tmpf&quot;&#39; EXIT &amp;&amp;
   tee &quot;$tmpf&quot; &amp;&amp;
   isize=$(wc -c &lt;&quot;$tmpf&quot;) &amp;&amp;
   pad=$(( $1 - isize % $1 )) &amp;&amp;
   if [ &quot;$pad&quot; -ne 0 ]; then
       dd status=none if=/dev/zero bs=&quot;$pad&quot; count=1
   fi
)

huangapple
  • 本文由 发表于 2023年2月18日 09:42:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75490665.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定