Bash脚本问题。测试用例可能存在错误?

huangapple go评论106阅读模式
英文:

Bash scripting question. Potential Error with Testcase?

问题

I apologize, but I won't be able to translate or process code. If you have any other non-code-related text you'd like me to translate or assist with, please provide that, and I'll be happy to help.

英文:

I am given a logfile containing live trading logs. Now I need to aggregate the symbols we trade along with the number of trades for that symbol in decreasing order of the number of trades. Please note that each line corresponds to a single trade.

My Input - I have the logs in the following format:

Datetime(YYYYMMDDHHMMSS), Stock, Symbol, BUY/SELL, Quantity, Price
20190102091055,IBM,BUY,1.160
20190102091058,BABA,SELL,10,155
20190102091059,IBM,BUY,2,159

Expected Output:

IBM 2
BABA 1

This is the code I have tried:

#!/bin/bash

declare -A symbols  # Declare an associative array to store the symbols and their counts

# Read input line by line
while read line
do
  # Extract the stock symbol from the line
  symbol=$(echo "$line" | cut -d',' -f2)

  # If the symbol already exists in the array, increment its count
  # Otherwise, add the symbol to the array with a count of 1
  if [[ ${symbols[$symbol]+_} ]]; then
    symbols[$symbol]=$((symbols[$symbol]+1))
  else
    symbols[$symbol]=1
  fi
done < "${1:-/dev/stdin}" # Read input from stdin or from a file specified as a command-line argument

# Sort the symbols and their counts in descending order of count
for symbol in "${!symbols[@]}"
do
  echo "$symbol ${symbols[$symbol]}"
done | sort -rnk2

Error we received:

Error code:

Input (stdin)

20190527200918,AMZN,SELL,32,1830
20190527200918,AMZN,BUY,26,1827
20190527200918,IBM,SELL,12,139
20190527200918,IBM,SELL,93,144
20190527200918,IBM,SELL,6,141
20190527200918,AMZN,BUY,44,1833
20190527200918,GOOG,SELL,77,1145
20190527200918,GOOG,BUY,89,1135
20190527200918,IBM,BUY,21,139
20190527200918,AMZN,BUY,89,1834
20190527200918,IBM,SELL,80,139
20190527200918,MSFT,SELL,48,135
20190527200918,MSFT,BUY,66,131
20190527200918,MSFT,SELL,21,141
20190527200918,AMZN,SELL,5,1826
20190527200918,MSFT,BUY,47,141
20190527200918,AMZN,SELL,19,1833
20190527200918,AMZN,BUY,22,1831
20190527200918,IBM,BUY,75,139
20190527200918,GOOG,BUY,70,1141
20190527200918,AAPL,SELL,43,182
20190527200918,MSFT,BUY,7,136
20190527200918,GOOG,SELL,89,1147
20190527200918,AMZN,SELL,54,1828
20190527200918,AAPL,SELL,7,189
20190527200918,MSFT,SELL,66,136
20190527200918,AAPL,SELL,31,189
20190527200918,IBM,BUY,39,137
20190527200918,MSFT,SELL,10,128
20190527200918,IBM,BUY,15,146
20190527200918,IBM,SELL,38,133
20190527200918,IBM,SELL,76,146
20190527200918,G{-truncated-}

Your Output (stdout)

IBM 26
MSFT 22
AMZN 21
GOOG 17
AAPL 13

Expected Output

IBM 27
MSFT 22
AMZN 21
GOOG 17
AAPL 13

答案1

得分: 1

由于一个符号的交易在最后一行输入上,而该行没有以结束分隔符“换行符”终止,所以导致“你的输出”与“预期输出”相差一个交易的可能解释是这个。可以通过测试分配的行而不是read的退出状态来避免此错误,即将while read line更改为while read line; [ "$line" ]

英文:

A possible explanation for Your Output differing from the Expected Output by one trade for one symbol is that that symbol trade is on the last input line and that line is not terminated by the ending delimiter newline. This error could be eluded by testing the assigned line rather than the exit status of read, i. e. by changing while read line to while read line; [ "$line" ].

答案2

得分: 1

Armali建议,一个损坏的记录将执行此操作。

c.f. BashFAQ #1 - you might try

awk将更快,而且更容易跳过标题,不会在损坏的记录上出错。

printf还提供了很多格式选项,如果你愿意使用制表符,或者将字段设置为固定宽度的对齐字段。我们也可以在awk中写一个排序,但sort经过了很好的优化,而这样做更少。你的结果可能会有所不同。

我的输入 -

如果你真的只想在原生的bash中进行操作 -

只是为了比较,我复制/粘贴了上面的文件中的非标题数据,使其足够大,以便时间测试能够提供有用的信息。

使用awk

使用M. Nejat Aydin的管道:

使用我上面列出的你的bash版本:

你可以看到,在任何大小的文件上,性能差异是显著的

英文:

As Armali suggests, a broken record will do this.

$: head -9 file | wc
      9      14     314
$: head -9 file | ./tst # listed below (after adding fix suggested below)
IBM 5
AMZN 2
BABA 1

$: head -c 310 file | wc    # knocks off last 4 bytes
      8      14     310
$: head -c 310 file | ./tst # loses broken record
IBM 4
AMZN 2
BABA 1

c.f. BashFAQ #1 - you might try

while IFS=, read -a _ symbol - || [[ -n "$symbol" ]]

which checks the target var and processes if it got a value. With that added, the outputs will be the same.

awk will be a lot faster, though, makes it easy to skip the header, and won't choke on that broken record.

awk -F, 'NR>1{sym[$2]++}END{for(k in sym)printf "%s %s\n",k,sym[k]}' "$1" | sort -k2nr
IBM 12
AMZN 8
MSFT 7
GOOG 4
AAPL 3
BABA 1

$: head -9 file | awk -F, 'NR>1{ sym[$2]++; } END { for(k in sym) printf "%s %s\n", k, sym[k]}' | sort -k2nr
IBM 5
AMZN 2
BABA 1

$: head -c 310 file | awk -F, 'NR>1{ sym[$2]++; } END { for(k in sym) printf "%s %s\n", k, sym[k]}' | sort -k2nr
IBM 5
AMZN 2
BABA 1

printf also gives you a LOT of formatting options, if (for example) you'd rather use a tab, or set the fields to justified fixed widths. We could write a sort in the awk as well, but sort is pretty well optimized, and this is less work. YMMV.

My input -

$: cat file
Datetime(YYYYMMDDHHMMSS), Stock, Symbol, BUY/SELL, Quantity, Price
20190102091055,IBM,BUY,1,160
20190102091058,BABA,SELL,10,155
20190102091059,IBM,BUY,2,159
20190527200918,AMZN,SELL,32,1830
20190527200918,AMZN,BUY,26,1827
20190527200918,IBM,SELL,12,139
20190527200918,IBM,SELL,93,144
20190527200918,IBM,SELL,6,141
20190527200918,AMZN,BUY,44,1833
20190527200918,GOOG,SELL,77,1145
20190527200918,GOOG,BUY,89,1135
20190527200918,IBM,BUY,21,139
20190527200918,AMZN,BUY,89,1834
20190527200918,IBM,SELL,80,139
20190527200918,MSFT,SELL,48,135
20190527200918,MSFT,BUY,66,131
20190527200918,MSFT,SELL,21,141
20190527200918,AMZN,SELL,5,1826
20190527200918,MSFT,BUY,47,141
20190527200918,AMZN,SELL,19,1833
20190527200918,AMZN,BUY,22,1831
20190527200918,IBM,BUY,75,139
20190527200918,GOOG,BUY,70,1141
20190527200918,AAPL,SELL,43,182
20190527200918,MSFT,BUY,7,136
20190527200918,GOOG,SELL,89,1147
20190527200918,AMZN,SELL,54,1828
20190527200918,AAPL,SELL,7,189
20190527200918,MSFT,SELL,66,136
20190527200918,AAPL,SELL,31,189
20190527200918,IBM,BUY,39,137
20190527200918,MSFT,SELL,10,128
20190527200918,IBM,BUY,15,146
20190527200918,IBM,SELL,38,133
20190527200918,IBM,SELL,76,146

If you just really want it in native bash -

$: cat tst
#!/bin/bash
declare -A cnt=() # associative array for symbols counts

while IFS=, read _ sym _ || [[ -n "$sym" ]]     # allows for broken last record
do [[ "${sym:0:1}" == ' ' ]] || ((cnt[$sym]++)) # skips the header
done < "${1:-/dev/stdin}"

for sym in "${!cnt[@]}"; do echo "$sym ${cnt[$sym]}"; done | sort -rnk2

$: ./tst file
IBM 12
AMZN 8
MSFT 7
GOOG 4
AAPL 3
BABA 1

Just for comparison, I copy/pasted the non-header data in the above file over and over to make it big enough for time tests to be informative.

$: ls -l file
-rw-r--r-- 1 paul 1049089 454071699 Mar  9 09:52 file

Using awk:

$: time { awk -F, 'NR>1{ sym[$2]++; } END { for(k in sym) printf "%s %s\n", k, sym[k]}' file | sort -k2nr; }
IBM 4958016
AMZN 3305344
MSFT 2892176
GOOG 1652672
AAPL 1239504
BABA 413168

real    0m4.179s
user    0m3.905s
sys     0m0.341s

Using M. Nejat Aydin's pipeline:

$: time { cut -d, -f2 file | sort | uniq -c | sort -nr; }
4958016 IBM
3305344 AMZN
2892176 MSFT
1652672 GOOG
1239504 AAPL
 413168 BABA
      1  Stock

real    0m13.070s
user    0m14.623s
sys     0m1.512s

Using my version of your bash as listed above:

$: time ./tst file
IBM 4958016
AMZN 3305344
MSFT 2892176
GOOG 1652672
AAPL 1239504
BABA 413168

real    21m35.327s
user    6m16.405s
sys     15m11.232s

You can see that on a file of any size, the performance difference is significant.

答案3

得分: 0

不要使用纯粹的 bash 来执行这个任务;它运行较慢。下面的一行命令应该更快地完成任务:

cut -d, -f2 file | sort | uniq -c | sort -nr

或者,使用 awk 来执行:

awk -F, '{++c[$2]} END{for (i in c) print i,c[i]}' file | sort -k2nr
英文:

Do not use pure bash for this task; it's slow. This one liner should do the trick much faster:

cut -d, -f2 file | sort | uniq -c | sort -nr

or, using awk

awk -F, '{++c[$2]} END{for (i in c) print i,c[i]}' file | sort -k2nr

huangapple
  • 本文由 发表于 2023年3月9日 15:47:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75681709.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定