英文:
Clickhouse-client Code: 36. DB::Exception: Positional options are not supported. (BAD_ARGUMENTS) bash script
问题
这是我的用于并行将Parquet文件插入ClickHouse的Bash脚本。尽管我不知道原因,但它一直给我报错,如标题中所示。感谢任何帮助。
#!/bin/bash
time (for FILENAME in /mnt/sdc/traces/part-*.snappy.parquet; do
echo $FILENAME
xargs -P 6 -n 1 -0 clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" < $FILENAME
done)
英文:
Here is my bash script for inserting parquets in parallel to clickhouse. It keeps giving me the error I put in the title though and I don't know why. Any help is appreciated
#!/bin/bash
time (for FILENAME in /mnt/sdc/traces/part-*.snappy.parquet; do
echo $FILENAME
xargs -P 6 -n 1 -0 clickhouse-client --receive_timeout=100000 --query=\"INSERT INTO ethereum.traces FORMAT Parquet\" < $FILENAME
done)
答案1
得分: 1
#!/bin/bash
cpu_count=6
batch_size=4
printf '%s#!/bin/bash
cpu_count=6
batch_size=4
printf '%s\0' /mnt/sdc/traces/part-*.snappy.parquet |
xargs -P"$cpu_count" -n"$batch_size" -0 sh -c '
for filename in "$@"; do
echo "$filename"
clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" < "$filename"
done
' _
' /mnt/sdc/traces/part-*.snappy.parquet |
xargs -P"$cpu_count" -n"$batch_size" -0 sh -c '
for filename in "$@"; do
echo "$filename"
clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" < "$filename"
done
' _
xargs
需要其标准输入是要传递给它调用的程序的参数列表。这在你原始的代码中并不是这种情况,原始代码直接将 parquet 文件传递给了 xargs 的标准输入,而在这里,我们将传递给它的是一个用 NUL 分隔的 parquet 文件名列表。xargs
中的-n
参数告诉它每次传递给sh
的文件数量。使用较低的数字如 1 会降低当文件数量低于批量大小时未能有效并行化的概率,但会增加启动新 shell 的性能开销。
英文:
One way to implement this would look like:
#!/bin/bash
cpu_count=6
batch_size=4
printf '%s#!/bin/bash
cpu_count=6
batch_size=4
printf '%s\0' /mnt/sdc/traces/part-*.snappy.parquet |
xargs -P"$cpu_count" -n"$batch_size" -0 sh -c '
for filename in "$@"; do
echo "$filename"
clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" <"$filename"
done
' _
' /mnt/sdc/traces/part-*.snappy.parquet |
xargs -P"$cpu_count" -n"$batch_size" -0 sh -c '
for filename in "$@"; do
echo "$filename"
clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" <"$filename"
done
' _
xargs
requires its stdin to be a list of arguments to pass to the program it invokes. That wasn't the case at all in your original code, which was passing xargs parquet files directly on its stdin -- whereas here, we're passing it a NUL-delimited list of names of parquet files.- The
-n
argument toxargs
tells it how many files to pass to each copy ofsh
. Using a low number like 1 reduces the chance that you won't be parallelizing well when the number of files left is below the batch size, but increases the performance overhead of starting up new shells.
答案2
得分: 0
尝试在那两个双引号前面不加反斜杠。
英文:
Try it without the backslash in front of those two double-quotes.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论