Clickhouse-client Code: 36. DB::Exception: Positional options are not supported. (BAD_ARGUMENTS) bash script

huangapple go评论113阅读模式
英文:

Clickhouse-client Code: 36. DB::Exception: Positional options are not supported. (BAD_ARGUMENTS) bash script

问题

这是我的用于并行将Parquet文件插入ClickHouse的Bash脚本。尽管我不知道原因,但它一直给我报错,如标题中所示。感谢任何帮助。

#!/bin/bash
time (for FILENAME in /mnt/sdc/traces/part-*.snappy.parquet; do
            echo $FILENAME
            xargs -P 6 -n 1 -0 clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" < $FILENAME
        done)
英文:

Here is my bash script for inserting parquets in parallel to clickhouse. It keeps giving me the error I put in the title though and I don't know why. Any help is appreciated

#!/bin/bash
time (for FILENAME in /mnt/sdc/traces/part-*.snappy.parquet; do
            echo $FILENAME
            xargs -P 6 -n 1 -0 clickhouse-client --receive_timeout=100000 --query=\&quot;INSERT INTO ethereum.traces FORMAT Parquet\&quot; &lt; $FILENAME
        done)

答案1

得分: 1

#!/bin/bash
cpu_count=6
batch_size=4

printf '%s
#!/bin/bash
cpu_count=6
batch_size=4

printf '%s\0' /mnt/sdc/traces/part-*.snappy.parquet |
  xargs -P"$cpu_count" -n"$batch_size" -0 sh -c '
    for filename in "$@"; do
      echo "$filename"
      clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" < "$filename"
    done
  ' _
'
/mnt/sdc/traces/part-*.snappy.parquet |
xargs -P"$cpu_count" -n"$batch_size" -0 sh -c ' for filename in "$@"; do echo "$filename" clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" < "$filename" done ' _
  • xargs 需要其标准输入是要传递给它调用的程序的参数列表。这在你原始的代码中并不是这种情况,原始代码直接将 parquet 文件传递给了 xargs 的标准输入,而在这里,我们将传递给它的是一个用 NUL 分隔的 parquet 文件名列表。
  • xargs 中的 -n 参数告诉它每次传递给 sh 的文件数量。使用较低的数字如 1 会降低当文件数量低于批量大小时未能有效并行化的概率,但会增加启动新 shell 的性能开销。
英文:

One way to implement this would look like:

#!/bin/bash
cpu_count=6
batch_size=4

printf &#39;%s
#!/bin/bash
cpu_count=6
batch_size=4
printf &#39;%s\0&#39; /mnt/sdc/traces/part-*.snappy.parquet |
xargs -P&quot;$cpu_count&quot; -n&quot;$batch_size&quot; -0 sh -c &#39;
for filename in &quot;$@&quot;; do
echo &quot;$filename&quot;
clickhouse-client --receive_timeout=100000 --query=&quot;INSERT INTO ethereum.traces FORMAT Parquet&quot; &lt;&quot;$filename&quot;
done
&#39; _
&#39; /mnt/sdc/traces/part-*.snappy.parquet | xargs -P&quot;$cpu_count&quot; -n&quot;$batch_size&quot; -0 sh -c &#39; for filename in &quot;$@&quot;; do echo &quot;$filename&quot; clickhouse-client --receive_timeout=100000 --query=&quot;INSERT INTO ethereum.traces FORMAT Parquet&quot; &lt;&quot;$filename&quot; done &#39; _
  • xargs requires its stdin to be a list of arguments to pass to the program it invokes. That wasn't the case at all in your original code, which was passing xargs parquet files directly on its stdin -- whereas here, we're passing it a NUL-delimited list of names of parquet files.
  • The -n argument to xargs tells it how many files to pass to each copy of sh. Using a low number like 1 reduces the chance that you won't be parallelizing well when the number of files left is below the batch size, but increases the performance overhead of starting up new shells.

答案2

得分: 0

尝试在那两个双引号前面不加反斜杠。

英文:

Try it without the backslash in front of those two double-quotes.

huangapple
  • 本文由 发表于 2023年2月27日 10:36:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75576345.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定