英文:
Bash scripting question. Potential Error with Testcase?
问题
I apologize, but I won't be able to translate or process code. If you have any other non-code-related text you'd like me to translate or assist with, please provide that, and I'll be happy to help.
英文:
I am given a logfile containing live trading logs. Now I need to aggregate the symbols we trade along with the number of trades for that symbol in decreasing order of the number of trades. Please note that each line corresponds to a single trade.
My Input - I have the logs in the following format:
Datetime(YYYYMMDDHHMMSS), Stock, Symbol, BUY/SELL, Quantity, Price
20190102091055,IBM,BUY,1.160
20190102091058,BABA,SELL,10,155
20190102091059,IBM,BUY,2,159
Expected Output:
IBM 2
BABA 1
This is the code I have tried:
#!/bin/bash
declare -A symbols # Declare an associative array to store the symbols and their counts
# Read input line by line
while read line
do
# Extract the stock symbol from the line
symbol=$(echo "$line" | cut -d',' -f2)
# If the symbol already exists in the array, increment its count
# Otherwise, add the symbol to the array with a count of 1
if [[ ${symbols[$symbol]+_} ]]; then
symbols[$symbol]=$((symbols[$symbol]+1))
else
symbols[$symbol]=1
fi
done < "${1:-/dev/stdin}" # Read input from stdin or from a file specified as a command-line argument
# Sort the symbols and their counts in descending order of count
for symbol in "${!symbols[@]}"
do
echo "$symbol ${symbols[$symbol]}"
done | sort -rnk2
Error we received:
Error code:
Input (stdin)
20190527200918,AMZN,SELL,32,1830
20190527200918,AMZN,BUY,26,1827
20190527200918,IBM,SELL,12,139
20190527200918,IBM,SELL,93,144
20190527200918,IBM,SELL,6,141
20190527200918,AMZN,BUY,44,1833
20190527200918,GOOG,SELL,77,1145
20190527200918,GOOG,BUY,89,1135
20190527200918,IBM,BUY,21,139
20190527200918,AMZN,BUY,89,1834
20190527200918,IBM,SELL,80,139
20190527200918,MSFT,SELL,48,135
20190527200918,MSFT,BUY,66,131
20190527200918,MSFT,SELL,21,141
20190527200918,AMZN,SELL,5,1826
20190527200918,MSFT,BUY,47,141
20190527200918,AMZN,SELL,19,1833
20190527200918,AMZN,BUY,22,1831
20190527200918,IBM,BUY,75,139
20190527200918,GOOG,BUY,70,1141
20190527200918,AAPL,SELL,43,182
20190527200918,MSFT,BUY,7,136
20190527200918,GOOG,SELL,89,1147
20190527200918,AMZN,SELL,54,1828
20190527200918,AAPL,SELL,7,189
20190527200918,MSFT,SELL,66,136
20190527200918,AAPL,SELL,31,189
20190527200918,IBM,BUY,39,137
20190527200918,MSFT,SELL,10,128
20190527200918,IBM,BUY,15,146
20190527200918,IBM,SELL,38,133
20190527200918,IBM,SELL,76,146
20190527200918,G{-truncated-}
Your Output (stdout)
IBM 26
MSFT 22
AMZN 21
GOOG 17
AAPL 13
Expected Output
IBM 27
MSFT 22
AMZN 21
GOOG 17
AAPL 13
答案1
得分: 1
由于一个符号的交易在最后一行输入上,而该行没有以结束分隔符“换行符”终止,所以导致“你的输出”与“预期输出”相差一个交易的可能解释是这个。可以通过测试分配的行而不是read
的退出状态来避免此错误,即将while read line
更改为while read line; [ "$line" ]
。
英文:
A possible explanation for Your Output differing from the Expected Output by one trade for one symbol is that that symbol trade is on the last input line and that line is not terminated by the ending delimiter newline. This error could be eluded by testing the assigned line rather than the exit status of read
, i. e. by changing while read line
to while read line; [ "$line" ]
.
答案2
得分: 1
Armali建议,一个损坏的记录将执行此操作。
c.f. BashFAQ #1 - you might try
awk
将更快,而且更容易跳过标题,不会在损坏的记录上出错。
printf
还提供了很多格式选项,如果你愿意使用制表符,或者将字段设置为固定宽度的对齐字段。我们也可以在awk
中写一个排序,但sort
经过了很好的优化,而这样做更少。你的结果可能会有所不同。
我的输入 -
如果你真的只想在原生的bash
中进行操作 -
只是为了比较,我复制/粘贴了上面的文件中的非标题数据,使其足够大,以便时间测试能够提供有用的信息。
使用awk
:
使用M. Nejat Aydin的管道:
使用我上面列出的你的bash
版本:
你可以看到,在任何大小的文件上,性能差异是显著的。
英文:
As Armali suggests, a broken record will do this.
$: head -9 file | wc
9 14 314
$: head -9 file | ./tst # listed below (after adding fix suggested below)
IBM 5
AMZN 2
BABA 1
$: head -c 310 file | wc # knocks off last 4 bytes
8 14 310
$: head -c 310 file | ./tst # loses broken record
IBM 4
AMZN 2
BABA 1
c.f. BashFAQ #1 - you might try
while IFS=, read -a _ symbol - || [[ -n "$symbol" ]]
which checks the target var and processes if it got a value. With that added, the outputs will be the same.
awk
will be a lot faster, though, makes it easy to skip the header, and won't choke on that broken record.
awk -F, 'NR>1{sym[$2]++}END{for(k in sym)printf "%s %s\n",k,sym[k]}' "$1" | sort -k2nr
IBM 12
AMZN 8
MSFT 7
GOOG 4
AAPL 3
BABA 1
$: head -9 file | awk -F, 'NR>1{ sym[$2]++; } END { for(k in sym) printf "%s %s\n", k, sym[k]}' | sort -k2nr
IBM 5
AMZN 2
BABA 1
$: head -c 310 file | awk -F, 'NR>1{ sym[$2]++; } END { for(k in sym) printf "%s %s\n", k, sym[k]}' | sort -k2nr
IBM 5
AMZN 2
BABA 1
printf
also gives you a LOT of formatting options, if (for example) you'd rather use a tab, or set the fields to justified fixed widths. We could write a sort in the awk
as well, but sort
is pretty well optimized, and this is less work. YMMV.
My input -
$: cat file
Datetime(YYYYMMDDHHMMSS), Stock, Symbol, BUY/SELL, Quantity, Price
20190102091055,IBM,BUY,1,160
20190102091058,BABA,SELL,10,155
20190102091059,IBM,BUY,2,159
20190527200918,AMZN,SELL,32,1830
20190527200918,AMZN,BUY,26,1827
20190527200918,IBM,SELL,12,139
20190527200918,IBM,SELL,93,144
20190527200918,IBM,SELL,6,141
20190527200918,AMZN,BUY,44,1833
20190527200918,GOOG,SELL,77,1145
20190527200918,GOOG,BUY,89,1135
20190527200918,IBM,BUY,21,139
20190527200918,AMZN,BUY,89,1834
20190527200918,IBM,SELL,80,139
20190527200918,MSFT,SELL,48,135
20190527200918,MSFT,BUY,66,131
20190527200918,MSFT,SELL,21,141
20190527200918,AMZN,SELL,5,1826
20190527200918,MSFT,BUY,47,141
20190527200918,AMZN,SELL,19,1833
20190527200918,AMZN,BUY,22,1831
20190527200918,IBM,BUY,75,139
20190527200918,GOOG,BUY,70,1141
20190527200918,AAPL,SELL,43,182
20190527200918,MSFT,BUY,7,136
20190527200918,GOOG,SELL,89,1147
20190527200918,AMZN,SELL,54,1828
20190527200918,AAPL,SELL,7,189
20190527200918,MSFT,SELL,66,136
20190527200918,AAPL,SELL,31,189
20190527200918,IBM,BUY,39,137
20190527200918,MSFT,SELL,10,128
20190527200918,IBM,BUY,15,146
20190527200918,IBM,SELL,38,133
20190527200918,IBM,SELL,76,146
If you just really want it in native bash
-
$: cat tst
#!/bin/bash
declare -A cnt=() # associative array for symbols counts
while IFS=, read _ sym _ || [[ -n "$sym" ]] # allows for broken last record
do [[ "${sym:0:1}" == ' ' ]] || ((cnt[$sym]++)) # skips the header
done < "${1:-/dev/stdin}"
for sym in "${!cnt[@]}"; do echo "$sym ${cnt[$sym]}"; done | sort -rnk2
$: ./tst file
IBM 12
AMZN 8
MSFT 7
GOOG 4
AAPL 3
BABA 1
Just for comparison, I copy/pasted the non-header data in the above file over and over to make it big enough for time tests to be informative.
$: ls -l file
-rw-r--r-- 1 paul 1049089 454071699 Mar 9 09:52 file
Using awk
:
$: time { awk -F, 'NR>1{ sym[$2]++; } END { for(k in sym) printf "%s %s\n", k, sym[k]}' file | sort -k2nr; }
IBM 4958016
AMZN 3305344
MSFT 2892176
GOOG 1652672
AAPL 1239504
BABA 413168
real 0m4.179s
user 0m3.905s
sys 0m0.341s
Using M. Nejat Aydin's pipeline:
$: time { cut -d, -f2 file | sort | uniq -c | sort -nr; }
4958016 IBM
3305344 AMZN
2892176 MSFT
1652672 GOOG
1239504 AAPL
413168 BABA
1 Stock
real 0m13.070s
user 0m14.623s
sys 0m1.512s
Using my version of your bash
as listed above:
$: time ./tst file
IBM 4958016
AMZN 3305344
MSFT 2892176
GOOG 1652672
AAPL 1239504
BABA 413168
real 21m35.327s
user 6m16.405s
sys 15m11.232s
You can see that on a file of any size, the performance difference is significant.
答案3
得分: 0
不要使用纯粹的 bash
来执行这个任务;它运行较慢。下面的一行命令应该更快地完成任务:
cut -d, -f2 file | sort | uniq -c | sort -nr
或者,使用 awk
来执行:
awk -F, '{++c[$2]} END{for (i in c) print i,c[i]}' file | sort -k2nr
英文:
Do not use pure bash
for this task; it's slow. This one liner should do the trick much faster:
cut -d, -f2 file | sort | uniq -c | sort -nr
or, using awk
awk -F, '{++c[$2]} END{for (i in c) print i,c[i]}' file | sort -k2nr
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论