2023年2月18日 09:42:11go评论63阅读模式

英文:

Which efficient & portable shell statement on GNU/Linux can zero-pad piped bytes to word boundary?

问题

我需要在字节流的末尾填充NUL字节，使得输出长度可以被N整除，以防超出可用存储和内存。我正在实现的函数背景如下：
#!/bin/sh
generate_arbitrary_length | paddingN | work_with_padded

对于N=8192的有效代码如下：
padding8192(){ dd status=none bs=8192 conv=sync ; }

但是对于小的N值来说，减小复制块的大小会*大幅度*减慢速度，这个方法并不适用：
padding4(){ dd status=none bs=4 conv=sync ; }

我可以使用`wc`和`dd`来表达计数和填充，通过复制输入流的方式：

padding4(){ { { tee /dev/fd/3 >&2 ; } 3>&1 | wc -c | { read -r isize ; pad=$(( 4 - isize % 4)) ; [ 0 -lt $pad ] && dd status=none if=/dev/zero bs=$pad count=1 >&2 ; } } 2>&1 ; }

这个方法已经快了很多。但是非常难以阅读——谁能说清楚为什么填充最终会到达文件结束符（EOF）？

有更好的方法吗？

尽管我只需要保留尽量少的状态来存储字节数除以字大小的余数，但我无法想出一个简单而又高效的shell内建实现方式。依赖应保持最小化：使用GNU coreutils/cpio/tar，不使用会在busybox/dash/bash之间有差异的编译器/perl/特性。我还没有想出一个`awk`解决方案，因为我无法使其在二进制输入上表现良好（每秒的字节数），而且输入并非等量的NL/NUL分隔成行。

英文:

I need to pad NUL bytes at the end of a byte stream exceeding available storage & memory, so output length is divisible by N. Context of the function I am implementing:

#!/bin/sh
generate_arbitrary_length | paddingN | work_with_padded

Working code for N=8192:

padding8192(){ dd status=none bs=8192 conv=sync ; }

But reducing copy block size is orders of magnitude slower for small N, this did not finish:

padding4(){ dd status=none bs=4 conv=sync ; }

I can express the counting & padding using wc and dd, after duplicating the input stream:

padding4(){ { { tee /dev/fd/3 &gt;&amp;2 ; } 3&gt;&amp;1 | wc -c | { read -r isize ; pad=$(( 4 - isize % 4)) ; [ 0 -lt $pad ] &amp;&amp; dd status=none if=/dev/zero bs=$pad count=1 &gt;&amp;2 ; } } 2&gt;&amp;1 ; }

Much faster already. But very difficult to read - who could even tell why padding ends up at EOF?

Any better approach?

Though I only need to keep as much state as needed to store byte count modulo word size, I cannot think of a simple yet performant implementation using shell builtins. Dependencies should remain minimal: using GNU coreutils/cpio/tar, no compiler/perl/features that would differ between busybox/dash/bash. I have not come up with an awk solution as I failed to make it perform well (G/s) on binary input not evenly NL/NUL-separated into lines.

答案1

得分: 1

由于你提到有一个编译器可用，这里有一个小巧、便携的C程序。它不会变得更快，也不会占用更多内存。对于大多数编程社区的人来说，这个程序甚至是可读的。如果不是的话，你总是可以撒上 /* 注释！*/。:-)

#!/bin/sh
#
# pad.sh - 对输入进行填充，从标准输入读取大块数据，写入标准输出。

# 填充 $1:填充字符 $2:对齐 $3:块大小
padding () {
aout="./a$$.out"
cc -x c -o "$aout" - <<EOF
#include <stdio.h>

int main (void) {
  size_t align = $2, nwritten = 0, nread;
  char buffer[$3];
  while ((nread = fread (buffer, 1, sizeof buffer, stdin)) > 0)
    nwritten += fwrite (buffer, 1, nread, stdout);
  if ((nwritten % align) != 0)
    for (align -= nwritten % align; align != 0; --align)
      putchar ($1);
  return 0;
}
EOF
"$aout" && rm "$aout"
}

printf '%s' 123456789 | padding 0       4 16384                  | od -c
printf '%s' abcdefghi | padding "'\n'" 16 BUFSIZ                 | od -c
printf '%s' PAGE_SIZE | padding 65     32 "$(getconf PAGE_SIZE)" | od -c

在运行时：

$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  $ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
  $ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
  $ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040

0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040

如果你担心非POSIX编译器选项 -x c，你可以轻松地将C程序写入 pad.c，然后从那里编译它。高级的 fwrite、fread 和 putchar 的错误处理留给读者。

请注意这个文档避免了 main 函数需要解析参数。你甚至可以传递像 PAGE_SIZE 这样的字符串，如果你的stdio默认提供了它们。

我刚意识到像这样编译C程序并没有太大区别于一个巧妙的awk脚本 -- awk也会编译内部程序然后执行它。还有什么比编译到机器的CPU上并运行可执行文件更好的呢？

英文:

Since you mention there's a compiler available, here's a tiny, portable C program. It does not get any faster and memory-economic. It's even readable for most people in the programming community. If not, you can always sprinkle /* Comments! */.

#!/bin/sh
#
# pad.sh - pad input, reading in large blocks from stdin, writing stdout.

# padding $1:padchar $2:alignment $3:blocksize
padding () {
aout=&quot;./a$$.out&quot;
cc -x c -o &quot;$aout&quot; - &lt;&lt;EOF
#include &lt;stdio.h&gt;

int main (void) {
  size_t align = $2, nwritten = 0, nread;
  char buffer[$3];
  while ((nread = fread (buffer, 1, sizeof buffer, stdin)) &gt; 0)
    nwritten += fwrite (buffer, 1, nread, stdout);
  if ((nwritten % align) != 0)
    for (align -= nwritten % align; align != 0; --align)
      putchar ($1);
  return 0;
}
EOF
&quot;$aout&quot; &amp;&amp; rm &quot;$aout&quot;
}

printf &#39;%s&#39; 123456789 | padding 0       4 16384                  | od -c
printf &#39;%s&#39; abcdefghi | padding &quot;&#39;\n&#39;&quot; 16 BUFSIZ                 | od -c
printf &#39;%s&#39; PAGE_SIZE | padding 65     32 &quot;$(getconf PAGE_SIZE)&quot; | od -c

In action:

$ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  $ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
  $ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040
  $ ./pad.sh
0000000    1   2   3   4   5   6   7   8   9  \0  \0  \0
0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040

0000014
0000000    a   b   c   d   e   f   g   h   i  \n  \n  \n  \n  \n  \n  \n
0000020
0000000    P   A   G   E   _   S   I   Z   E   A   A   A   A   A   A   A
0000020    A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
0000040

If you are concerned about the non-POSIXly compiler option -x c you can easily write the C program to pad.c and compile it from there. Advanced error handling for fwrite, fread and putchar left to the reader.

Note how the here-document avoids main having to parse arguments. You can even pass strings like PAGE_SIZE if your stdio makes them available by default.

I just realized that compiling C like this is not much different from a nifty awk script -- awk also compiles an internal program and then executes it. What's better than compiling to the machine's CPU and running the executable?

答案2

得分: 0

严格遵循POSIX规范的做法是使用临时文件。

    padding() (
       tmpf=$(mktemp) &&
       trap 'rm "$tmpf"' EXIT &&
       tee "$tmpf" &&
       isize=$(wc -c <"$tmpf") &&
       pad=$(( $1 - isize % $1 )) &&
       if [ "$pad" -ne 0 ]; then
           dd status=none if=/dev/zero bs="$pad" count=1
       fi
    )

英文:

The POSIX thing to do would be to use a temporary file.

padding() (
   tmpf=$(mktemp) &amp;&amp;
   trap &#39;rm &quot;$tmpf&quot;&#39; EXIT &amp;&amp;
   tee &quot;$tmpf&quot; &amp;&amp;
   isize=$(wc -c &lt;&quot;$tmpf&quot;) &amp;&amp;
   pad=$(( $1 - isize % $1 )) &amp;&amp;
   if [ &quot;$pad&quot; -ne 0 ]; then
       dd status=none if=/dev/zero bs=&quot;$pad&quot; count=1
   fi
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Which efficient & portable shell statement on GNU/Linux can zero-pad piped bytes to word boundary?

问题

答案1

答案2

“Explorer ‘浏览文件夹'”

是否可能从受限制的（自定义）shell中逃脱？

Shell command execution on Python2.7

如何验证所有CSV文件的第一行是否相同？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论