英文:
Which efficient & portable shell statement on GNU/Linux can zero-pad piped bytes to word boundary?
问题
我需要在字节流的末尾填充NUL字节,使得输出长度可以被N整除,以防超出可用存储和内存。我正在实现的函数背景如下:
#!/bin/sh
generate_arbitrary_length | paddingN | work_with_padded
对于N=8192的有效代码如下:
padding8192(){ dd status=none bs=8192 conv=sync ; }
但是对于小的N值来说,减小复制块的大小会*大幅度*减慢速度,这个方法并不适用:
padding4(){ dd status=none bs=4 conv=sync ; }
我可以使用`wc`和`dd`来表达计数和填充,通过复制输入流的方式:
padding4(){ { { tee /dev/fd/3 >&2 ; } 3>&1 | wc -c | { read -r isize ; pad=$(( 4 - isize % 4)) ; [ 0 -lt $pad ] && dd status=none if=/dev/zero bs=$pad count=1 >&2 ; } } 2>&1 ; }
这个方法已经快了很多。但是非常难以阅读——谁能说清楚为什么填充最终会到达文件结束符(EOF)?
有更好的方法吗?
尽管我只需要保留尽量少的状态来存储字节数除以字大小的余数,但我无法想出一个简单而又高效的shell内建实现方式。依赖应保持最小化:使用GNU coreutils/cpio/tar,不使用会在busybox/dash/bash之间有差异的编译器/perl/特性。我还没有想出一个`awk`解决方案,因为我无法使其在二进制输入上表现良好(每秒的字节数),而且输入并非等量的NL/NUL分隔成行。
英文:
I need to pad NUL bytes at the end of a byte stream exceeding available storage & memory, so output length is divisible by N. Context of the function I am implementing:
#!/bin/sh
generate_arbitrary_length | paddingN | work_with_padded
Working code for N=8192:
padding8192(){ dd status=none bs=8192 conv=sync ; }
But reducing copy block size is orders of magnitude slower for small N, this did not finish:
padding4(){ dd status=none bs=4 conv=sync ; }
I can express the counting & padding using wc
and dd
, after duplicating the input stream:
padding4(){ { { tee /dev/fd/3 >&2 ; } 3>&1 | wc -c | { read -r isize ; pad=$(( 4 - isize % 4)) ; [ 0 -lt $pad ] && dd status=none if=/dev/zero bs=$pad count=1 >&2 ; } } 2>&1 ; }
Much faster already. But very difficult to read - who could even tell why padding ends up at EOF?
Any better approach?
Though I only need to keep as much state as needed to store byte count modulo word size, I cannot think of a simple yet performant implementation using shell builtins. Dependencies should remain minimal: using GNU coreutils/cpio/tar, no compiler/perl/features that would differ between busybox/dash/bash. I have not come up with an awk
solution as I failed to make it perform well (G/s) on binary input not evenly NL/NUL-separated into lines.
答案1
得分: 1
由于你提到有一个编译器可用,这里有一个小巧、便携的C程序。它不会变得更快,也不会占用更多内存。对于大多数编程社区的人来说,这个程序甚至是可读的。如果不是的话,你总是可以撒上 /* 注释!*/
。:-)
#!/bin/sh
#
# pad.sh - 对输入进行填充,从标准输入读取大块数据,写入标准输出。
# 填充 $1:填充字符 $2:对齐 $3:块大小
padding () {
aout="./a$$.out"
cc -x c -o "$aout" - <<EOF
#include <stdio.h>
int main (void) {
size_t align = $2, nwritten = 0, nread;
char buffer[$3];
while ((nread = fread (buffer, 1, sizeof buffer, stdin)) > 0)
nwritten += fwrite (buffer, 1, nread, stdout);
if ((nwritten % align) != 0)
for (align -= nwritten % align; align != 0; --align)
putchar ($1);
return 0;
}
EOF
"$aout" && rm "$aout"
}
printf '%s' 123456789 | padding 0 4 16384 | od -c
printf '%s' abcdefghi | padding "'\n'" 16 BUFSIZ | od -c
printf '%s' PAGE_SIZE | padding 65 32 "$(getconf PAGE_SIZE)" | od -c
在运行时:
$ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 $ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 \0 \0 \0
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
$ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 \0 \0 \0
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
$ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 \0 \0 \0
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
如果你担心非POSIX编译器选项 -x c
,你可以轻松地将C程序写入 pad.c
,然后从那里编译它。高级的 fwrite
、fread
和 putchar
的错误处理留给读者。
请注意这个文档避免了 main
函数需要解析参数。你甚至可以传递像 PAGE_SIZE
这样的字符串,如果你的stdio默认提供了它们。
我刚意识到像这样编译C程序并没有太大区别于一个巧妙的awk脚本 -- awk也会编译内部程序然后执行它。还有什么比编译到机器的CPU上并运行可执行文件更好的呢?
英文:
Since you mention there's a compiler available, here's a tiny, portable C program. It does not get any faster and memory-economic. It's even readable for most people in the programming community. If not, you can always sprinkle /* Comments! */
.
#!/bin/sh
#
# pad.sh - pad input, reading in large blocks from stdin, writing stdout.
# padding $1:padchar $2:alignment $3:blocksize
padding () {
aout="./a$$.out"
cc -x c -o "$aout" - <<EOF
#include <stdio.h>
int main (void) {
size_t align = $2, nwritten = 0, nread;
char buffer[$3];
while ((nread = fread (buffer, 1, sizeof buffer, stdin)) > 0)
nwritten += fwrite (buffer, 1, nread, stdout);
if ((nwritten % align) != 0)
for (align -= nwritten % align; align != 0; --align)
putchar ($1);
return 0;
}
EOF
"$aout" && rm "$aout"
}
printf '%s' 123456789 | padding 0 4 16384 | od -c
printf '%s' abcdefghi | padding "'\n'" 16 BUFSIZ | od -c
printf '%s' PAGE_SIZE | padding 65 32 "$(getconf PAGE_SIZE)" | od -c
In action:
$ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 $ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 \0 \0 \0
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
$ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 \0 \0 \0
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
$ ./pad.sh
0000000 1 2 3 4 5 6 7 8 9 \0 \0 \0
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
0000014
0000000 a b c d e f g h i \n \n \n \n \n \n \n
0000020
0000000 P A G E _ S I Z E A A A A A A A
0000020 A A A A A A A A A A A A A A A A
0000040
If you are concerned about the non-POSIXly compiler option -x c
you can easily write the C program to pad.c
and compile it from there. Advanced error handling for fwrite
, fread
and putchar
left to the reader.
Note how the here-document avoids main having to parse arguments. You can even pass strings like PAGE_SIZE if your stdio makes them available by default.
I just realized that compiling C like this is not much different from a nifty awk script -- awk also compiles an internal program and then executes it. What's better than compiling to the machine's CPU and running the executable?
答案2
得分: 0
严格遵循POSIX规范的做法是使用临时文件。
padding() (
tmpf=$(mktemp) &&
trap 'rm "$tmpf"' EXIT &&
tee "$tmpf" &&
isize=$(wc -c <"$tmpf") &&
pad=$(( $1 - isize % $1 )) &&
if [ "$pad" -ne 0 ]; then
dd status=none if=/dev/zero bs="$pad" count=1
fi
)
英文:
The POSIX thing to do would be to use a temporary file.
padding() (
tmpf=$(mktemp) &&
trap 'rm "$tmpf"' EXIT &&
tee "$tmpf" &&
isize=$(wc -c <"$tmpf") &&
pad=$(( $1 - isize % $1 )) &&
if [ "$pad" -ne 0 ]; then
dd status=none if=/dev/zero bs="$pad" count=1
fi
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论