英文:
Parsing a large CSV file with unusual characters, spacing, brackets, and irregular returns in bash
问题
我有一个非常大的(1.5 GB)格式不正确的CSV文件,我需要将其读入R中,尽管文件本身是一个CSV,但由于放置不当的换行符,分隔符在一定数量的行之后会中断。
我有一个简化的示例[附加][1],但截断的视觉表示[2]如下所示:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000
-0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000]
[ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000
-0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000
-0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111
1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222
-2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222]
[-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222
2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222]
[-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222
-2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222]
[-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222
-2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ]
[-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222
-2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ]
[ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222
-2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000
-0.00000000 0.00000000 0.00000000]]"
CSV中的换行符都被表示为/n。
为了避免将所有内容加载到内存中,并尝试在其他环境中解析它作为数据框架,我一直在尝试从CSV中打印相关片段到终端,删除字符返回,压缩空格,并在变量之间插入逗号。
就像下面这样:
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
我主要尝试从带括号和方括号之间的行中提取所有信息:
awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}' Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','
输出为:
000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000
<details>
<summary>英文:</summary>
I have a very large (1.5 GB) malformed CSV file I need to read into R, and while the file itself is a CSV, the delimiters break after a certain number of lines due to poorly-placed line returns.
I have a reduced example [attached][1], but a [truncated visual representation][2] of that looks like this:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000
-0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000]
[ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000
-0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000
-0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111
1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222
-2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222]
[-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222
2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222]
[-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222
-2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222]
[-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222
-2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ]
[-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222
-2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ]
[ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222
-2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000
-0.00000000 0.00000000 0.00000000]]"
New lines and all as /n's in the CSVs.
To get around loading it all into memory and attempting to parse it as a dataframe in other environments, I have been trying to print relevant snippets from the CSV to the terminal with character returns removed, empty spaces collapsed, and commas entered in-between variables.
Like the following:
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
My main attempt to pull all the information from everything from a line between parentheses and brackets with:
awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}' Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','
outputting:
000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"
Gets close, but:
1. It _only_ prints the first instance so I need a way to find the other instances.
2. It's inserting commas into places in blank spaces before the characters I'm searching for (`"[[]]"`), which I don't need it to do.
3. It leaves some extra commas by the brackets that I haven't quite found the right call to `tr` for to remove due to the necessary escape characters.
[1]: https://www.dropbox.com/s/uz5rxc5lbp3jwk8/Malformed.csv?dl=0
[2]: https://www.dropbox.com/s/ov0mluis9lwbqws/Malformed_csv_Abridged.csv?dl=0
</details>
# 答案1
**得分**: 3
我不明白你的目标。对我来说,CSV文件似乎是一个正确的CSV文件。
如果你只想删除换行符,你可以使用[Miller][1]和[clean-whitespace verb][2]:
mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv
获得此链接 https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.csv
[![enter image description here][3]][3]
[1]: https://miller.readthedocs.io/en/6.7.0/installing-miller/
[2]: https://miller.readthedocs.io/en/6.7.0/reference-verbs/index.html#clean-whitespace
[3]: https://i.stack.imgur.com/zDnM8.png
<details>
<summary>英文:</summary>
I didn't understand your goal. The CSV file seems to me a correct CSV file.
If you just want to remove line breaks, you can use [Miller][1] and [clean-whitespace verb][2]:
mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv
to get this https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.csv
[![enter image description here][3]][3]
[1]: https://miller.readthedocs.io/en/6.7.0/installing-miller/
[2]: https://miller.readthedocs.io/en/6.7.0/reference-verbs/index.html#clean-whitespace
[3]: https://i.stack.imgur.com/zDnM8.png
</details>
# 答案2
**得分**: 1
以下是翻译好的代码部分:
```awk
NR==1 { print; next } # 打印标题
!in_merge && /["]/ { split($0,a,"\"") # 第一个双引号找到;在双引号上分割行
head = a[1] # 保存行的第一部分
data = "\"" a[2] # 保存双引号和行的第二部分
in_merge = 1 # 设置标志
next
}
in_merge { data = data " " $0 # 将当前行附加到“数据”中
if ( $0 ~ /["]/ ) { # 如果找到第二个双引号 => 处理“数据”
gsub(/[ ]+/,",",data) # 用单个逗号替换连续的空格
gsub(/,[]]/,"]",data) # 替换“,]”为“]”
gsub(/[[],/,"[",data) # 替换“[,”为“[”
print head data # 打印新行
in_merge = 0 # 清除标志
}
}
这是一个用于处理CSV文件中包含嵌套换行符的awk脚本,它会将数据中的嵌套换行符替换为逗号,并且处理了包含双引号的情况。
英文:
Assumptions:
- the only field that contains double quotes is the last field (
broken_column_var
) - within the last field we do not have to worry about embedded/escaped double quotes (ie, for each data line the last field has exactly two double quotes)
- all
broken_column_var
values contain at least one embedded linefeed (ie, eachbroken_column_var
value spans at least 2 physical lines); otherwise we need to add some code to address both double quotes residing on the same line ... doable, but will skip for now so as to not (further) complicate the proposed code
One (verbose) awk
approach to removing the embedded linefeeds from broken_column_var
while also replacing spaces with commas:
awk '
NR==1 { print; next } # print header
!in_merge && /["]/ { split($0,a,"\"") # 1st double quote found; split line on double quote
head = a[1] # save 1st part of line
data = "\"" a[2] # save double quote and 2nd part of line
in_merge = 1 # set flag
next
}
in_merge { data = data " " $0 # append current line to "data"
if ( $0 ~ /["]/ ) { # if 2nd double quote found => process "data"
gsub(/[ ]+/,",",data) # replace consecutive spaces with single comma
gsub(/,[]]/,"]",data) # replace ",]" with "]"
gsub(/[[],/,"[",data) # replace "[," with "["
print head data # print new line
in_merge = 0 # clear flag
}
}
' Malformed.csv
This generates:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222,2.22222222,-2.22222222,-2.22222222,2.2222222,-2.22222222,-2.22222222,-2.22222222,-2.22222222,2.22222222,2.22222222,2.22222222],[-2.22222222,-2.22222222,2.22222222,2.2222222,2.22222222,-2.22222222,2.2222222,-2.2222222,2.22222222,2.2222222,2.222222,-2.22222222],[-2.22222222,-2.2222222,2.22222222,2.2222222,2.22222222,-2.22222222,-2.22222222,-2.2222222,-2.22222222,2.22222222,2.2222222,2.22222222],[-2.22222222,-2.22222222,2.2222222,2.2222222,2.2222222,-2.22222222,-2.222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,2.2222222],[-2.22222222,-2.222222,2.22222222,2.22222222,2.22222222,-2.2222222,-2.2222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,-2.222222],[2.22222222,-2.22222222,-2.222222,-2.222222,-2.2222222,-2.22222222,-2.222222,-2.22222222,2.2222222,-2.2222222,2.2222222,2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[-0.00000000,0.00000000,-0.00000000,0.000000,-0.00000000,-0.00000000,0.00000000,0.00000000]]"
答案3
得分: 0
使用双引号作为字段分隔符。一个完整的记录有1或3个字段。
awk '
BEGIN {FS = OFS = "\""}
{$0 = prev $0; $1=$1}
NF % 2 == 1 {print; prev = ""; next}
{prev = $0}
END {if (prev) print prev}
' file.csv
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000] [ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000 -0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111 1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222 -2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222] [-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222 2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222] [-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222 -2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222] [-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222 -2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ] [-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222 -2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ] [ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222 -2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000 -0.00000000 0.00000000 0.00000000]]"
对于具有CSV库的编程语言,我发现perl的Text::CSV对于带引号的换行很有用:
perl -e '
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
while (my $row = $csv->getline ($fh)) {
$row->[-1] =~ s/\n//g;
$csv->say(STDOUT, $row);
}
'
英文:
Use double quote as the field separator. A complete record has 1 or 3 fields.
awk '
BEGIN {FS = OFS = "\""}
{$0 = prev $0; $1=$1}
NF % 2 == 1 {print; prev = ""; next}
{prev = $0}
END {if (prev) print prev}
' file.csv
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000] [ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000 -0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111 1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222 -2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222] [-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222 2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222] [-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222 -2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222] [-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222 -2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ] [-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222 -2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ] [ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222 -2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000 -0.00000000 0.00000000 0.00000000]]"
For a language with a CSV library, I've found perl's Text::CSV useful for quoted newlines:
perl -e '
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
while (my $row = $csv->getline ($fh)) {
$row->[-1] =~ s/\n//g;
$csv->say(STDOUT, $row);
}
'
答案4
得分: 0
这可能适用于您(GNU sed):
sed -E '1b
:a;N;/"$/!ba
s/"/\n&/
h
s/\n/ /2g
s/.*\n//
s/ +/,/g
s/,\]/]/g
s/\[,/[/g
H
g
s/\n.*\n//'
Forget the header line.
Gather up each record.
Introduce a newline before the last field.
Make a copy of the ameliorated record.
Replace all newlines from the second with spaces.
Remove up to the first introduced newline.
Replace spaces by commas.
Remove any introduced commas after or before square brackets.
Append the last field to the copy.
Make the copy current.
Remove everything between (and including) the introduced newlines.
N.B. Expects only the last field of each record is double quoted.
---
Alternative:
sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/\1,/;tb;s/,[/[/g;s/],/]/g'
<details>
<summary>英文:</summary>
This might work for you (GNU sed):
sed -E '1b
:a;N;/"$/!ba
s/"/\n&/
h
s/\n/ /2g
s/.*\n//
s/ +/,/g
s/,\]/]/g
s/\[,/[/g
H
g
s/\n.*\n//' file
Forget the header line.
Gather up each record.
Introduce a newline before the last field.
Make a copy of the ameliorated record.
Replace all newlines from the second with spaces.
Remove upto the first introduced newline.
Replace spaces by commas.
Remove any introduced commas after or before square brackets.
Append the last field to the copy.
Make the copy current.
Remove everything between (and including) the introduced newlines.
N.B. Expects only the last field of each record is double quoted.
---
Alternative:
sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/,/;tb;s/,\[/[/g;s/\],/]/g' file
</details>
# 答案5
**得分**: 0
你可以使用[GoCSV的replace命令](https://github.com/aotimme/gocsv/#replace)轻松去除换行符:
```none
gocsv replace \
-c broken_column_var \
-regex '\s+' \
-repl ' ' \
input.csv
这会将所有连续的空白字符 (\s+
) 归一化为单个空格。
一个非常简单的Python脚本也可以处理这个问题:
import csv
import re
ws_re = re.compile(r"\s+")
f_in = open("input.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
writer.writerow(next(reader)) # 传输标题行
for row in reader:
row[5] = ws_re.sub(" ", row[5])
writer.writerow(row)
英文:
You can use GoCSV's replace command to easily strip out newlines:
gocsv replace \
-c broken_column_var \
-regex '\s+' \
-repl ' ' \
input.csv
That normalizes all contiguous whitespace (\s+
) to a single space.
A very small Python script can also handle this:
import csv
import re
ws_re = re.compile(r"\s+")
f_in = open("input.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
writer.writerow(next(reader)) # transfer header
for row in reader:
row[5] = ws_re.sub(" ", row[5])
writer.writerow(row)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论