如何移动所关注列中小数点的位置(sed)

huangapple go评论55阅读模式
英文:

How to move the point of the decimals in the concerned column (sed)

问题

To achieve the last condition in your statement, which is moving the decimal point of %_headlines two places back to obtain the percentage, you can use the following sed command:

sed -i 's/\([0-9.]*\)\([0-9]\{2\}\);/.;/g' headlines_words.csv

This command will search for patterns where there are two digits after the decimal point in the %_headlines column and move the decimal point two places to the left. It will replace these patterns with the desired format.

After running this command, your CSV file should have the expected output you mentioned.

英文:

My assignment is to create a script called script.sed that must set the conditions below:

  • Erase the two first columns (already achieved in the code below)
  • Erase the lines with a 0 in the column "frequency" (already achieved in the code below)
  • Keep the headers for the fields not modified, erase the titles of the columns previously erased, and change the name of the column "freq_prop_headlines" to "%_headlines" (already achieved in the code below)
  • Move the decimal point of freq_prop_headlines two places back to
    obtain the percentage instead of the percentage by one, keeping all the
    digits and remove any unnecessary zeros before the decimal point.
    Some records are in scientific notation (all raised to -5) and
    they must also be treated, being displayed in decimal notation. (Note by the OP: For example, if there is a number like 0.000814664 it has to turn into 0.0814664)

So, I only have the last point remaining to resolve.

The CSV file I have to work with is called headlines_words.csv, and its 10 first rows are:

Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
12;277;2010;India;cricketer;0;1584;0.0;9;20;20;empowerment
13;278;2011;India;cricketer;0;2438;0.0;9;20;20;empowerment
14;279;2012;India;cricketer;0;3634;0.0;9;20;20;empowerment
15;280;2013;India;cricketer;4;4910;0.000814664;9;20;20;empowerment
16;281;2014;India;cricketer;6;7502;0.0007997869999999;9;20;20;empowerment
17;282;2015;India;cricketer;11;10532;0.001044436;9;20;20;empowerment
18;283;2016;India;cricketer;14;14012;0.000999144;9;20;20;empowerment
19;284;2017;India;cricketer;48;17097;0.00280751;9;20;20;empowerment
20;285;2018;India;cricketer;40;19170;0.002086594;9;20;20;empowerment
21;286;2019;India;cricketer;66;20849;0.003165619;9;20;20;empowerment

The code of the script.sed I already got is:

# Erase the two first columns:
s/^[^;]*;[^;]*;//

# Erase all rows with count 0 in the frequency column
/^.*;0;/d

# Rename freq_prop_headlines to %_headlines
s/freq_prop_headlines/%_headlines/

# Show the first 10 rows (to ease the code checking by the output)
10q

I have to run the command below (it's mandatory by the statement):

sed -f script.sed headlines_words.csv

Once run my code I get this:

year;country;word;frequency;count;%_headlines;word_len;freq_rank;hfreq_rank;theme
2013;India;cricketer;4;4910;0.000814664;9;20;20;empowerment
2014;India;cricketer;6;7502;0.0007997869999999;9;20;20;empowerment
2015;India;cricketer;11;10532;0.001044436;9;20;20;empowerment
2016;India;cricketer;14;14012;0.000999144;9;20;20;empowerment
2017;India;cricketer;48;17097;0.00280751;9;20;20;empowerment
2018;India;cricketer;40;19170;0.002086594;9;20;20;empowerment

The expected output must be:

year;country;word;frequency;count;%_headlines;word_len;freq_rank;hfreq_rank;theme
2013;India;cricketer;4;4910;0.0814664;9;20;20;empowerment
2014;India;cricketer;6;7502;0.07997869999999;9;20;20;empowerment
2015;India;cricketer;11;10532;0.1044436;9;20;20;empowerment
2016;India;cricketer;14;14012;0.0999144;9;20;20;empowerment
2017;India;cricketer;48;17097;0.280751;9;20;20;empowerment
2018;India;cricketer;40;19170;0.2086594;9;20;20;empowerment

Now how can I set the last condition of the statement?

答案1

得分: 1

以下是翻译好的部分:

假设所有的值都小于0.1,因此操作可以通过简单地移动小数点或在科学计数法中更改指数来完成,而我们可以依赖于需要调整的字段始终位于第六列(在删除前两列之后),请尝试:

s/^\([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;\)0\.0\([0-9]\)\([0-9]*;\)/./
s/^\([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;1\.[0-9]*[eE]\)-5;/-3;/

如果您可以访问sed -E-r,则正则表达式可以大大简化,以提高可读性和可维护性:

s/^(([^;]*;){5})0\.0([0-9])([0-9]*;)/./
s/^(([^;]*;){5}1\.[0.9]*[eE])-5;/-3;/

换句话说,捕获前五列,以便我们可以用它们自己替换它们,捕获我们需要保留和重新排列的第六列的部分。捕获的组从左边的括号开始编号,所以\1指的是第一个组捕获的内容(在这些情况下是前五列),\2指的是左起第二个最左边的括号捕获的内容,依此类推。

顺便说一下,重命名其中一个标题的表达式严格来说只应该应用于第一行:

1s/freq_prop_headlines/%_headlines/

更严格地说,您可能应该将表达式锚定以避免匹配其他字段名称的子字符串;但我将其留作练习。

英文:

Assuming all the values are less than 0.1 and that the operation can thus be completed simply by moving the decimal point or changing the exponent in the case of a value in scientific notation, and that we can rely on the field which needs to be adjusted to always be in the sixth column (after you deleted the first two), try

s/^\([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;\)0\.0\([0-9]\)\([0-9]*;\)/./
s/^\([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;1\.[0-9]*[eE]\)-5;/-3;/

If you have access to sed -E or -r the regex can be simplified considerably, for improved legibility and maintainability:

s/^(([^;]*;){5})0\.0([0-9])([0-9]*;)/./
s/^(([^;]*;){5}1\.[0-9]*[eE])-5;/-3;/

In other words, capture the first five columns just so we can replace them with themselves, and capture the parts of the sixth column which we need to keep and reorder. The captured groups are numbered from the left opening parenthesis, so \1 refers to what the first group captured (in these cases, the first five columns), \2 to whatever the second leftmost parentheses captured, etc.

As an aside, the expression which renames one of the headers should strictly speaking only be applied to the first line.

1s/freq_prop_headlines/%_headlines/

Even more strictly speaking, you should perhaps anchor the eypression to avoid matching a substring of another field name; but I'll leave that as an exercise.

答案2

得分: 0

如果 freq_prop_headlines 是唯一包含小数的列,将以下内容添加到您的脚本可能会有所帮助。

# freq_prop_headlines 不是指数
# 例如) 0.0012345 ---> 000.12345
/;([0-9]+)\.([0-9]{2})([0-9]*);/s//;\1\2.\3;/

# freq_prop_headlines 是指数
# 例如) 123.45e-5 ---> 000123.45 ---> 000.12345
/;([0-9]+)\.([0-9]+)[eE]-5;/{
  s//;000\1\.\2;/
  s/([0-9]+)([0-9]{3})\.([0-9]+)/\1.\2\3/
}

# 例如) 000.12345 ---> 0.12345
s/;0+\./;0./

# 例如) 0001.2345 ---> 1.2345
s/;0+([1-9]+\.)/;\1/
英文:

If freq_prop_headlines is the only column containing decimals, adding the following to your script might help.

# freq_prop_headlines is not exponential
# ex) 0.0012345 ---> 000.12345
/;([0-9]+)\.([0-9]{2})([0-9]*);/s//;.;/

# freq_prop_headlines is exponential
# ex) 123.45e-5 ---> 000123.45 ---> 000.12345
/;([0-9]+)\.([0-9]+)[eE]-5;/{
  s//;000\.;/
  s/([0-9]+)([0-9]{3})\.([0-9]+)/./
}

# ex) 000.12345 ---> 0.12345
s/;0+\./;0./

# ex) 0001.2345 ---> 1.2345
s/;0+([1-9]+\.)/;/

huangapple
  • 本文由 发表于 2023年5月6日 19:54:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76188719.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定