2023年2月9日 01:36:59go评论73阅读模式

英文:

awk delete line from csv if char length of second column is less than 12

问题

我想删除第二列字符长度小于12的行。我认为awk可以实现这个目标：

awk -F , '$2=length>12' file > filout

但这个命令看起来有误

我想要删除的行是：

2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;

英文:

I have a csv which looks like so:

42242,&quot;France.&quot;
2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;brazil&quot;
23432423,&quot;colombia&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;
234234,&quot;Paris.&quot;

I would like to delete rows where the char length of the second column is <12.

I think awk can do this:

awk -F , &#39;$2=length&gt;12&#39; file &gt;filout

but this seems wrong..

I want to delete the line to get:

2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;

答案1

得分: 3

$ awk -F, 'length($2)>=12' input_file
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

英文:

$ awk -F, &#39;length($2)&gt;=12&#39; input_file
2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;

答案2

得分: 1

由于您的第二个字段包含在双引号内，您必须使用双引号而不是逗号作为分隔符来确定第二个字段的长度：
使用双引号作为分隔符：
awk -F\&quot; &#39;length($2)&gt;=12&#39; file
---
如果只打印第二个字段的长度，您会明白我的意思。首先使用逗号作为分隔符：
awk -F, &#39;{print length($2)}&#39; file
9
25
8
10
20
8
其次，使用双引号作为分隔符：
awk -F\&quot; &#39;{print length($2)}&#39; file
7
79
6
8
80
6

英文:

As your second field is contained within double quotes, you must use the double quote, rather than the comma, as the separator to determine the length of the second field:

awk -F\&quot; &#39;length($2)&gt;=12&#39; file

If you just print the length of the second field, you will see what I mean. First using the comma as separator:

awk -F, &#39;{print length($2)}&#39; file
9
25
8
10
20
8

Second, using the double quote as the separator:

awk -F\&quot; &#39;{print length($2)}&#39; file
7
79
6
8
80
6

答案3

得分: 0

添加 gsub(/[\200-\277]/, "&") 以正确地按字节模式测量 UTF-8 字符的数量，假设输入为 格式良好的UTF-8 。如果您正在使用 gawk 的Unicode模式，则跳过此部分。

echo '42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."' |

— 以CSV方式进行操作，没有适当的解析器（或具有Unicode支持的awk） - 它测量完整行的长度，减去第一个逗号的字符串索引位置，然后再减去2个引号：

mawk '(_+=++_)^_^_<(-_-- + length($--_) \
                         - index($_, FS) \
                         - gsub(/[0-7]/, "&amp;))' FS=',';

— 以双引号（"..."）方式进行操作：

gawk '((_+=++_)^_^_-_^_)<length($(--_+_--))' FS='"'

 1 2343242,"Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France."
 2 234234,"Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

英文:

adding gsub(/[\200-\277]/, "&") to properly measure # of UTF-8 characters in byte mode, assuming well-formed UTF-8 input. skip this part if you're using gawk in unicode mode

echo &#39;42242,&quot;France.&quot;
2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
234234,&quot;brazil&quot;
23432423,&quot;colombia&quot;
234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;
234234,&quot;Paris.&quot;&#39; |

>— to do it the CSV way without a proper parser (or an awk that's even unicode aware) - it measures full row length, minus string index position of 1st comma, then minus 2 more for the quotation marks :

mawk &#39;(_+=++_)^_^_&lt;(-_-- + length($--_) \
                         - index($_, FS) \
                         - gsub(/[0-7]/, &quot;&amp;&quot;))&#39; FS=&#39;,&#39;

>- to do it the double-quotes ("...") way :

gawk &#39;((_+=++_)^_^_-_^_)&lt;length($(--_+_--))&#39; FS=&#39;&quot;&#39;

>
1 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
2 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

答案4

得分: 0

使用 `ruby` 和 `CSV.parse` 处理稍作修改的数据以在带引号的逗号上显示正确输出。
```ruby
% ruby -r 'csv' -ne 'if CSV.parse($_).map {|i| 
    if i[1].length >= 12 then true end}[0] then puts $_ end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

数据

% cat  file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."


<details>
<summary>英文:</summary>
Using `ruby` with `CSV.parse` on slightly modified data to show correct output on commas within quotes.

% ruby -r 'csv' -ne 'if CSV.parse($).map {|i|
if i[1].length >= 12 then true end}[0] then puts $ end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."


#### Data ####

% cat file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."


</details>
# 答案5
**得分**: 0
42242,"法国"
2343242,"巴黎儿童医院病毒学实验室, AP-HP, 法国巴黎."
234234,"巴西"
23432423,"哥伦比亚"
234234,"巴黎大学, 科欣研究所, INSERM U1016, CNRS UMR8104, 法国巴黎."
234234,"巴黎."
<details>
<summary>英文:</summary>
    42242,&quot;France.&quot;
    2343242,&quot;Laboratoire de Virologie, AP-HP, H&#244;pital Necker-Enfants malades, Paris, France.&quot;
    234234,&quot;brazil&quot;
    23432423,&quot;colombia&quot;
    234234,&quot;Universit&#233; de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France.&quot;
    234234,&quot;Paris.&quot;
does have `,` *inside* quoted values, therefore just setting field separator to `,` would not work, observe that e.g. 2nd columnd of 2nd row will be `&quot;Laboratoire de Virologie`, to counter that you might use `FPAT` provided in [More CSV chapter of GNU AWK manual][1] as follows
    awk &#39;BEGIN{FPAT=&quot;([^,]*)|(\&quot;([^\&quot;]|\&quot;\&quot;)+\&quot;)&quot;}length($2)&gt;12&#39; file.csv
though keep in mind that `&quot;` are included in quotes fields, so you might need to adjust value inside comparison.
  [1]: https://www.gnu.org/savannah-checkouts/gnu/gawk/manual/html_node/More-CSV.html
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

awk从CSV中删除第二列字符长度小于12的行。

问题

答案1

答案2

答案3

答案4

数据

循环一个带有变量的AWK脚本

如何使用awk不处理第一行？

具有一个变量的输入，使用静态列表动态地为另一个变量创建值。

比较文件基于file1的第一列，并打印在其他文件中不存在的行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。