awk从CSV中删除第二列字符长度小于12的行。

huangapple go评论70阅读模式
英文:

awk delete line from csv if char length of second column is less than 12

问题

我想删除第二列字符长度小于12的行。我认为awk可以实现这个目标:

  1. awk -F , '$2=length>12' file > filout

但这个命令看起来有误 awk从CSV中删除第二列字符长度小于12的行。

我想要删除的行是:

  1. 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
  2. 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
英文:

I have a csv which looks like so:

  1. 42242,"France."
  2. 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
  3. 234234,"brazil"
  4. 23432423,"colombia"
  5. 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
  6. 234234,"Paris."

I would like to delete rows where the char length of the second column is <12.

I think awk can do this:

  1. awk -F , &#39;$2=length&gt;12&#39; file &gt;filout

but this seems wrong.. awk从CSV中删除第二列字符长度小于12的行。

I want to delete the line to get:

  1. 2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
  2. 234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;

答案1

得分: 3

$ awk -F, 'length($2)>=12' input_file
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

英文:
  1. $ awk -F, &#39;length($2)&gt;=12&#39; input_file
  2. 2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
  3. 234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;

答案2

得分: 1

  1. 由于您的第二个字段包含在双引号内,您必须使用双引号而不是逗号作为分隔符来确定第二个字段的长度:
  2. 使用双引号作为分隔符:
  3. awk -F\&quot; &#39;length($2)&gt;=12&#39; file
  4. ---
  5. 如果只打印第二个字段的长度,您会明白我的意思。首先使用逗号作为分隔符:
  6. awk -F, &#39;{print length($2)}&#39; file
  7. 9
  8. 25
  9. 8
  10. 10
  11. 20
  12. 8
  13. 其次,使用双引号作为分隔符:
  14. awk -F\&quot; &#39;{print length($2)}&#39; file
  15. 7
  16. 79
  17. 6
  18. 8
  19. 80
  20. 6
英文:

As your second field is contained within double quotes, you must use the double quote, rather than the comma, as the separator to determine the length of the second field:

  1. awk -F\&quot; &#39;length($2)&gt;=12&#39; file

If you just print the length of the second field, you will see what I mean. First using the comma as separator:

  1. awk -F, &#39;{print length($2)}&#39; file
  2. 9
  3. 25
  4. 8
  5. 10
  6. 20
  7. 8

Second, using the double quote as the separator:

  1. awk -F\&quot; &#39;{print length($2)}&#39; file
  2. 7
  3. 79
  4. 6
  5. 8
  6. 80
  7. 6

答案3

得分: 0

添加 gsub(/[\200-\277]/, "&amp;") 以正确地按字节模式测量 UTF-8 字符的数量,假设输入为 格式良好的UTF-8 。如果您正在使用 gawk 的Unicode模式,则跳过此部分

  1. echo '42242,"France."
  2. 2343242,"Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France."
  3. 234234,"brazil"
  4. 23432423,"colombia"
  5. 234234,"Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
  6. 234234,"Paris."' |

— 以CSV方式进行操作,没有适当的解析器(或具有Unicode支持的awk) - 它测量完整行的长度,减去第一个逗号的字符串索引位置,然后再减去2个引号:

    1. mawk '(_+=++_)^_^_<(-_-- + length($--_) \
    2. - index($_, FS) \
    3. - gsub(/[0-7]/, "&amp;))' FS=',';

— 以双引号(&quot;...&quot;)方式进行操作:

    1. gawk '((_+=++_)^_^_-_^_)<length($(--_+_--))' FS='"'
  1. 1 2343242,"Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France."
  2. 2 234234,"Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
英文:

adding gsub(/[\200-\277]/, &quot;&amp;&quot;) to properly measure # of UTF-8 characters in byte mode, assuming well-formed UTF-8 input. skip this part if you're using gawk in unicode mode

  1. echo &#39;42242,&quot;France.&quot;
  2. 2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
  3. 234234,&quot;brazil&quot;
  4. 23432423,&quot;colombia&quot;
  5. 234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;
  6. 234234,&quot;Paris.&quot;&#39; |

>— to do it the CSV way without a proper parser (or an awk that's even unicode aware) - it measures full row length, minus string index position of 1st comma, then minus 2 more for the quotation marks :

    1. mawk &#39;(_+=++_)^_^_&lt;(-_-- + length($--_) \
    2. - index($_, FS) \
    3. - gsub(/[0-7]/, &quot;&amp;&quot;))&#39; FS=&#39;,&#39;

>- to do it the double-quotes (&quot;...&quot;) way :

    1. gawk &#39;((_+=++_)^_^_-_^_)&lt;length($(--_+_--))&#39; FS=&#39;&quot;&#39;

>
1 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
2 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

答案4

得分: 0

  1. 使用 `ruby` `CSV.parse` 处理稍作修改的数据以在带引号的逗号上显示正确输出。
  2. ```ruby
  3. % ruby -r 'csv' -ne 'if CSV.parse($_).map {|i|
  4. if i[1].length >= 12 then true end}[0] then puts $_ end' file
  5. 2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
  6. 23432423,Institut Cochin,"colombia"
  7. 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

数据

  1. % cat file
  2. 42242,"France."
  3. 2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
  4. 234234,"brazil"
  5. 23432423,Institut Cochin,"colombia"
  6. 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
  7. 234234,"Paris."
  1. <details>
  2. <summary>英文:</summary>
  3. Using `ruby` with `CSV.parse` on slightly modified data to show correct output on commas within quotes.

% ruby -r 'csv' -ne 'if CSV.parse($).map {|i|
if i[1].length >= 12 then true end}[0] then puts $
end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

  1. #### Data ####

% cat file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."

  1. </details>
  2. # 答案5
  3. **得分**: 0
  4. 42242,"法国"
  5. 2343242,"巴黎儿童医院病毒学实验室, AP-HP, 法国巴黎."
  6. 234234,"巴西"
  7. 23432423,"哥伦比亚"
  8. 234234,"巴黎大学, 科欣研究所, INSERM U1016, CNRS UMR8104, 法国巴黎."
  9. 234234,"巴黎."
  10. <details>
  11. <summary>英文:</summary>
  12. 42242,&quot;France.&quot;
  13. 2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
  14. 234234,&quot;brazil&quot;
  15. 23432423,&quot;colombia&quot;
  16. 234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;
  17. 234234,&quot;Paris.&quot;
  18. does have `,` *inside* quoted values, therefore just setting field separator to `,` would not work, observe that e.g. 2nd columnd of 2nd row will be `&quot;Laboratoire de Virologie`, to counter that you might use `FPAT` provided in [More CSV chapter of GNU AWK manual][1] as follows
  19. awk &#39;BEGIN{FPAT=&quot;([^,]*)|(\&quot;([^\&quot;]|\&quot;\&quot;)+\&quot;)&quot;}length($2)&gt;12&#39; file.csv
  20. though keep in mind that `&quot;` are included in quotes fields, so you might need to adjust value inside comparison.
  21. [1]: https://www.gnu.org/savannah-checkouts/gnu/gawk/manual/html_node/More-CSV.html
  22. </details>

huangapple
  • 本文由 发表于 2023年2月9日 01:36:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389694.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定