awk从CSV中删除第二列字符长度小于12的行。

huangapple go评论48阅读模式
英文:

awk delete line from csv if char length of second column is less than 12

问题

我想删除第二列字符长度小于12的行。我认为awk可以实现这个目标:

awk -F , '$2=length>12' file > filout

但这个命令看起来有误 awk从CSV中删除第二列字符长度小于12的行。

我想要删除的行是:

2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
英文:

I have a csv which looks like so:

42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."

I would like to delete rows where the char length of the second column is <12.

I think awk can do this:

awk -F , &#39;$2=length&gt;12&#39; file &gt;filout

but this seems wrong.. awk从CSV中删除第二列字符长度小于12的行。

I want to delete the line to get:

2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;

答案1

得分: 3

$ awk -F, 'length($2)>=12' input_file
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

英文:
$ awk -F, &#39;length($2)&gt;=12&#39; input_file
2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;

答案2

得分: 1

由于您的第二个字段包含在双引号内,您必须使用双引号而不是逗号作为分隔符来确定第二个字段的长度:

使用双引号作为分隔符:

awk -F\&quot; &#39;length($2)&gt;=12&#39; file

---

如果只打印第二个字段的长度,您会明白我的意思。首先使用逗号作为分隔符:

awk -F, &#39;{print length($2)}&#39; file
9
25
8
10
20
8

其次,使用双引号作为分隔符:

awk -F\&quot; &#39;{print length($2)}&#39; file
7
79
6
8
80
6
英文:

As your second field is contained within double quotes, you must use the double quote, rather than the comma, as the separator to determine the length of the second field:

awk -F\&quot; &#39;length($2)&gt;=12&#39; file

If you just print the length of the second field, you will see what I mean. First using the comma as separator:

awk -F, &#39;{print length($2)}&#39; file
9
25
8
10
20
8

Second, using the double quote as the separator:

awk -F\&quot; &#39;{print length($2)}&#39; file
7
79
6
8
80
6

答案3

得分: 0

添加 gsub(/[\200-\277]/, "&amp;") 以正确地按字节模式测量 UTF-8 字符的数量,假设输入为 格式良好的UTF-8 。如果您正在使用 gawk 的Unicode模式,则跳过此部分

echo '42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."' |

— 以CSV方式进行操作,没有适当的解析器(或具有Unicode支持的awk) - 它测量完整行的长度,减去第一个逗号的字符串索引位置,然后再减去2个引号:

  • mawk '(_+=++_)^_^_<(-_-- + length($--_) \
                             - index($_, FS) \
                             - gsub(/[0-7]/, "&amp;))' FS=',';
    

— 以双引号(&quot;...&quot;)方式进行操作:

  • gawk '((_+=++_)^_^_-_^_)<length($(--_+_--))' FS='"'
    
 1 2343242,"Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France."
 2 234234,"Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
英文:

adding gsub(/[\200-\277]/, &quot;&amp;&quot;) to properly measure # of UTF-8 characters in byte mode, assuming well-formed UTF-8 input. skip this part if you're using gawk in unicode mode

echo &#39;42242,&quot;France.&quot;
2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;brazil&quot;
23432423,&quot;colombia&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;
234234,&quot;Paris.&quot;&#39; | 

>— to do it the CSV way without a proper parser (or an awk that's even unicode aware) - it measures full row length, minus string index position of 1st comma, then minus 2 more for the quotation marks :

  • mawk &#39;(_+=++_)^_^_&lt;(-_-- + length($--_) \
                             - index($_, FS) \
                             - gsub(/[0-7]/, &quot;&amp;&quot;))&#39; FS=&#39;,&#39;
    

>- to do it the double-quotes (&quot;...&quot;) way :

  • gawk &#39;((_+=++_)^_^_-_^_)&lt;length($(--_+_--))&#39; FS=&#39;&quot;&#39;
    

>
1 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
2 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

答案4

得分: 0

使用 `ruby` 和 `CSV.parse` 处理稍作修改的数据以在带引号的逗号上显示正确输出。

```ruby
% ruby -r 'csv' -ne 'if CSV.parse($_).map {|i| 
    if i[1].length >= 12 then true end}[0] then puts $_ end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

数据

% cat  file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."

<details>
<summary>英文:</summary>

Using `ruby` with `CSV.parse` on slightly modified data to show correct output on commas within quotes.

% ruby -r 'csv' -ne 'if CSV.parse($).map {|i|
if i[1].length >= 12 then true end}[0] then puts $
end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."


#### Data ####

% cat file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."


</details>



# 答案5
**得分**: 0

42242,"法国"
2343242,"巴黎儿童医院病毒学实验室, AP-HP, 法国巴黎."
234234,"巴西"
23432423,"哥伦比亚"
234234,"巴黎大学, 科欣研究所, INSERM U1016, CNRS UMR8104, 法国巴黎."
234234,"巴黎."

<details>
<summary>英文:</summary>

    42242,&quot;France.&quot;
    2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
    234234,&quot;brazil&quot;
    23432423,&quot;colombia&quot;
    234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;
    234234,&quot;Paris.&quot;

does have `,` *inside* quoted values, therefore just setting field separator to `,` would not work, observe that e.g. 2nd columnd of 2nd row will be `&quot;Laboratoire de Virologie`, to counter that you might use `FPAT` provided in [More CSV chapter of GNU AWK manual][1] as follows

    awk &#39;BEGIN{FPAT=&quot;([^,]*)|(\&quot;([^\&quot;]|\&quot;\&quot;)+\&quot;)&quot;}length($2)&gt;12&#39; file.csv

though keep in mind that `&quot;` are included in quotes fields, so you might need to adjust value inside comparison.


  [1]: https://www.gnu.org/savannah-checkouts/gnu/gawk/manual/html_node/More-CSV.html

</details>



huangapple
  • 本文由 发表于 2023年2月9日 01:36:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389694.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定