英文:
awk delete line from csv if char length of second column is less than 12
问题
我想删除第二列字符长度小于12的行。我认为awk可以实现这个目标:
awk -F , '$2=length>12' file > filout
但这个命令看起来有误
我想要删除的行是:
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
英文:
I have a csv which looks like so:
42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."
I would like to delete rows where the char length of the second column is <12.
I think awk can do this:
awk -F , '$2=length>12' file >filout
but this seems wrong..
I want to delete the line to get:
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
答案1
得分: 3
$ awk -F, 'length($2)>=12' input_file
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
英文:
$ awk -F, 'length($2)>=12' input_file
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
答案2
得分: 1
由于您的第二个字段包含在双引号内,您必须使用双引号而不是逗号作为分隔符来确定第二个字段的长度:
使用双引号作为分隔符:
awk -F\" 'length($2)>=12' file
---
如果只打印第二个字段的长度,您会明白我的意思。首先使用逗号作为分隔符:
awk -F, '{print length($2)}' file
9
25
8
10
20
8
其次,使用双引号作为分隔符:
awk -F\" '{print length($2)}' file
7
79
6
8
80
6
英文:
As your second field is contained within double quotes, you must use the double quote, rather than the comma, as the separator to determine the length of the second field:
awk -F\" 'length($2)>=12' file
If you just print the length of the second field, you will see what I mean. First using the comma as separator:
awk -F, '{print length($2)}' file
9
25
8
10
20
8
Second, using the double quote as the separator:
awk -F\" '{print length($2)}' file
7
79
6
8
80
6
答案3
得分: 0
添加 gsub(/[\200-\277]/, "&")
以正确地按字节模式测量 UTF-8
字符的数量,假设输入为 格式良好的UTF-8
。如果您正在使用 gawk
的Unicode模式,则跳过此部分。
echo '42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."' |
— 以
CSV
方式进行操作,没有适当的解析器(或具有Unicode支持的awk
) - 它测量完整行的长度,减去第一个逗号的字符串索引位置,然后再减去2个引号:
-
mawk '(_+=++_)^_^_<(-_-- + length($--_) \ - index($_, FS) \ - gsub(/[0-7]/, "&))' FS=',';
— 以双引号(
"..."
)方式进行操作:
-
gawk '((_+=++_)^_^_-_^_)<length($(--_+_--))' FS='"'
1 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
2 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
英文:
adding gsub(/[\200-\277]/, "&")
to properly measure # of UTF-8
characters in byte mode, assuming well-formed UTF-8
input. skip this part if you're using gawk
in unicode mode
echo '42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."' |
>— to do it the CSV
way without a proper parser (or an awk
that's even unicode aware) - it measures full row length, minus string index position of 1st comma, then minus 2 more for the quotation marks :
-
mawk '(_+=++_)^_^_<(-_-- + length($--_) \ - index($_, FS) \ - gsub(/[0-7]/, "&"))' FS=','
>- to do it the double-quotes ("..."
) way :
-
gawk '((_+=++_)^_^_-_^_)<length($(--_+_--))' FS='"'
>
1 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
2 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
答案4
得分: 0
使用 `ruby` 和 `CSV.parse` 处理稍作修改的数据以在带引号的逗号上显示正确输出。
```ruby
% ruby -r 'csv' -ne 'if CSV.parse($_).map {|i|
if i[1].length >= 12 then true end}[0] then puts $_ end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
数据
% cat file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."
<details>
<summary>英文:</summary>
Using `ruby` with `CSV.parse` on slightly modified data to show correct output on commas within quotes.
% ruby -r 'csv' -ne 'if CSV.parse($).map {|i|
if i[1].length >= 12 then true end}[0] then puts $ end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
#### Data ####
% cat file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."
</details>
# 答案5
**得分**: 0
42242,"法国"
2343242,"巴黎儿童医院病毒学实验室, AP-HP, 法国巴黎."
234234,"巴西"
23432423,"哥伦比亚"
234234,"巴黎大学, 科欣研究所, INSERM U1016, CNRS UMR8104, 法国巴黎."
234234,"巴黎."
<details>
<summary>英文:</summary>
42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."
does have `,` *inside* quoted values, therefore just setting field separator to `,` would not work, observe that e.g. 2nd columnd of 2nd row will be `"Laboratoire de Virologie`, to counter that you might use `FPAT` provided in [More CSV chapter of GNU AWK manual][1] as follows
awk 'BEGIN{FPAT="([^,]*)|(\"([^\"]|\"\")+\")"}length($2)>12' file.csv
though keep in mind that `"` are included in quotes fields, so you might need to adjust value inside comparison.
[1]: https://www.gnu.org/savannah-checkouts/gnu/gawk/manual/html_node/More-CSV.html
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论