CSV文件未正确读取(几乎有一半的行被删除)。

huangapple go评论57阅读模式
英文:

csv file not read correctly (almost half of the rows are removed)

问题

我想使用R读取一个CSV文件。它以分号作为分隔符,并使用双引号作为引号。

然而,该文件中有一些问题单元格(例如,某些文本单元格中有一个双引号,或者文本中有一个分号),导致R无法正确读取该文件。

由于我无法上传文件,以下是使用文本编辑器打开CSV文件时的外观:

"X1";"X2";"X3"
"1";"0";"test"
"37129";"11746; weird";"weird "thing""
"27929";"1";"xyz Limited"

第1行包含标题。在第3行,您可以看到X2值在"11746"后有一个分号,X3值在"thing"前有一个额外的双引号。

因此,在使用readr::read_csv2(file = my_file.csv)读取此文件时,这个小示例看起来正常,但实际上不是(请注意,我的真实文件有13k行)。例如,我们希望在"thing"前后看到双引号。

当我改用readr::read_csv2(file = my_file.csv, quote = "")时,第2行的值因为分号而错位。

有什么办法可以解决这个问题吗?

预期输出应为:

# A tibble: 3 × 3
     X1    X2                 X3
  <dbl> <chr>              <chr>
1     1     0              "test"
2 37129 11746; weird  "\"weird thing\""
3 27929     1        "xyz Limited"
英文:

I want to read in a csv file with R. It has a semicolon as the separator and uses double quotes as quotes.

However, the file has some problematic cells (e.g. some have one double quote in text cells or an semicolon in the text) which leads to R not reading the file in correctly.

Since I can't upload a file, here's how the csv looks like when opened with a text editor:

&quot;X1&quot;;&quot;X2&quot;;&quot;X3&quot;
&quot;1&quot;;&quot;0&quot;;&quot;test&quot;
&quot;37129&quot;;&quot;11746; weird&quot;;&quot;weird &quot;thing&quot;&quot;
&quot;27929&quot;;&quot;1&quot;;&quot;xyz Limited&quot;

Row 1 contain the headlines. In row 3 you can see that the X2 value has a semicolon after "11746" and the X3 value has an additional double quote before "thing".

So when reading this file in with readr::read_csv2(file = my_file.csv) in this small example it looks ok, but it actually isn't (note that my real file has 13k rows). For example, we'd expect to see double quotes before and after "thing".

When I use readr::read_csv2(file = my_file.csv, quote = &quot;&quot;) instead, row 2's values are shifted due to the semicolon.

Any idea, how I could solve this issue?

The expected output would be:

# A tibble: 3 &#215; 3
     X1 X2           X3             
  &lt;dbl&gt; &lt;chr&gt;        &lt;chr&gt;          
1     1 0            &quot;test&quot;         
2 37129 11746; weird &quot;\&quot;weird thing\&quot;&quot;
3 27929 1            &quot;xyz Limited&quot;

答案1

得分: 2

这不符合标准的分号分隔值(CSV)格式。应该将&quot;thing&quot;引号重复一次以进行转义,就像&quot;37129&quot;;&quot;11746; weird&quot;;&quot;weird &quot;&quot;thing&quot;&quot;&quot;一样。创建该文件的任何工具都没有遵守标准。

我们可以尝试使用正则表达式来解析内容,将嵌套的&quot;引号更改为双重的&quot;&quot;(而不更改封闭的&quot;引号),可能是这样的:

## 在 shell 中,而不是在 R 中
$ sed -E &#39;s/([^;])&quot;([^;])/&quot;&quot;/g&#39; quux.csv &gt; newquux.csv
## 在 R 中
read.csv2(&quot;newquux.csv&quot;)
#      X1           X2            X3
# 1     1            0          test
# 2 37129 11746; weird weird &quot;thing&quot;
# 3 27929            1   xyz Limited

有人可能会诱惑在 R 中完成这个操作,这通常是可以的。但是,如果 CSV 文件非常大,那么这将导致 R 的内存消耗不必要地增加...只有在文件很大的情况下才会出现问题。

read.csv2(
  text = gsub(&#39;(?&lt;=[^;])&quot;(?=[^;])&#39;, &#39;&quot;&quot;&#39;,
              readLines(&quot;quux.csv&quot;), perl = TRUE))
#      X1           X2            X3
# 1     1            0          test
# 2 37129 11746; weird weird &quot;thing&quot;
# 3 27929            1   xyz Limited

上述两种方法在使用 readr::read_csv2 时同样有效:

readr::read_csv2(&quot;newquux.csv&quot;)
readr::read_csv2(
  I(gsub(&#39;(?&lt;=[^;])&quot;(?=[^;])&#39;, &#39;&quot;&quot;&#39;,
         readLines(&quot;quux.csv&quot;), perl = TRUE)))

(这里需要I(..),否则它会尝试查找名为&quot;X1&quot;;&quot;X2&quot;;&quot;X3&quot;的文件。)

英文:

That is not standards-compliant semi-colon-CSV. The &quot;thing&quot; quotes should be doubled to be escaped, as in &quot;37129&quot;;&quot;11746; weird&quot;;&quot;weird &quot;&quot;thing&quot;&quot;&quot;. Whatever is creating that file is not adhering to the standard.

We can try to use regex to parse the contents, changing the embedded &quot; quotes to double &quot;&quot; (and not changing the enclosing &quot;s), perhaps this:

## on the shell, not in R
$ sed -E &#39;s/([^;])&quot;([^;])/&quot;&quot;/g&#39; quux.csv &gt; newquux.csv
## in R
read.csv2(&quot;newquux.csv&quot;)
#      X1           X2            X3
# 1     1            0          test
# 2 37129 11746; weird weird &quot;thing&quot;
# 3 27929            1   xyz Limited

One might be tempted to do this in R instead, which is generally fine. However, if the csv file is huge, then this will cause an unnecessary growth in the R's memory consumption ... again, only an issue if your file is huge.

read.csv2(
  text = gsub(&#39;(?&lt;=[^;])&quot;(?=[^;])&#39;, &#39;&quot;&quot;&#39;,
              readLines(&quot;quux.csv&quot;), perl = TRUE))
#      X1           X2            X3
# 1     1            0          test
# 2 37129 11746; weird weird &quot;thing&quot;
# 3 27929            1   xyz Limited

Both of the above work just as well with readr::read_csv2:

readr::read_csv2(&quot;newquux.csv&quot;)
readr::read_csv2(
  I(gsub(&#39;(?&lt;=[^;])&quot;(?=[^;])&#39;, &#39;&quot;&quot;&#39;,
         readLines(&quot;quux.csv&quot;), perl = TRUE)))

(The I(..) is necessary here since otherwise it opts to look for a file named &quot;X1&quot;;&quot;X2&quot;;&quot;X3&quot;.)

答案2

得分: 1

我认为你需要指定&quot;作为引号字符,我用以下方式得到了我认为你想要的结果:

read.table("test.csv", sep=";", quote = "\"", header= FALSE)

然而,由于它们是引号字符,引号已被移除,你是否想要保留它们?

输出:

     V1           V2          V3
1    X1           X2          X3
2     1            0        test
3 37129 11746; weird weird thing
4 27929            1 xyz Limited
英文:

I think what you need is to speficy &quot; as the quote character, I got what I think you want with this:

read.table(&quot;test.csv&quot;, sep=&quot;;&quot;, quote = &quot;\&quot;&quot;, header= FALSE)

However, the quotes are removed as they are quoting characters, do you want to keep them as well?

The output:

     V1           V2          V3
1    X1           X2          X3
2     1            0        test
3 37129 11746; weird weird thing
4 27929            1 xyz Limited

答案3

得分: 1

这是一个非标准的文件格式,但我们可以使用readLines读取它,然后用一个不在文件中的其他字符替换定界引号 - 在这里,我们使用单引号作为该字符。

"my_file.csv" |>
  readLines() |>
  gsub('^"|"$', "'", x = _) |>
  gsub('" ";" "', "';'", x = _) |>
  read.csv2(text = _, quote = "'")
##      X1           X2            X3
## 1     1            0          test
## 2 37129 11746; weird weird 'thing'
## 3 27929            1   xyz Limited

我们可以使用此代码来检查文件中包含哪些字符。未列出的任何字符都可以用作新的引号字符。

"my_file.csv" |>
  readLines() |>
  strsplit("") |>
  unlist() |>
  table() |>
  names() |>
  cat("\n")
## " ; 0 1 2 3 4 6 7 9 d e g h i L m n r s t w x X y z

或者,我们可以使用此代码来检查特定字符(在此处为单引号)是否出现在文件中。如果返回1,就像在问题的示例输入中一样,那么该字符不会出现,否则会出现。

max(count.fields("my_file.csv", sep = "'", quote = ""))
## [1] 1
英文:

This is a non-standard file format but we can read it in using readLines and then replace the delimiting quotes with some other character not in the file -- here we use a single quote as that character.

&quot;my_file.csv&quot; |&gt;
  readLines() |&gt;
  gsub(&#39;^&quot;|&quot;$&#39;, &quot;&#39;&quot;, x = _) |&gt;  # replace &quot; with &#39; at start &amp; end of each line
  gsub(&#39;&quot;;&quot;&#39;, &quot;&#39;;&#39;&quot;, x = _) |&gt;  # replace &quot;;&quot; with &#39;;&#39;
  read.csv2(text = _, quote = &quot;&#39;&quot;)
##      X1           X2            X3
## 1     1            0          test
## 2 37129 11746; weird weird &quot;thing&quot;
## 3 27929            1   xyz Limited

We can use this code to check which characters are in the file. Any character not listed can be used as the new quote character.

&quot;my_file.csv&quot; |&gt;
  readLines() |&gt;
  strsplit(&quot;&quot;) |&gt;
  unlist() |&gt;
  table() |&gt;
  names() |&gt;
  cat(&quot;\n&quot;)
## &quot; ; 0 1 2 3 4 6 7 9 d e g h i L m n r s t w x X y z 

or we can use this code to check whether a particular character, here single quote, appears in the file. If 1 is returned as it does with the example input in the question then that character does not appear and otherwise it does.

max(count.fields(&quot;my_file.csv&quot;, sep = &quot;&#39;&quot;, quote = &quot;&quot;))
## [1] 1

huangapple
  • 本文由 发表于 2023年7月3日 23:08:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76606012.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定