Importing a .txt file into R.

huangapple go评论114阅读模式
英文:

Importing a .txt file into R

问题

I am attempting to import a .txt file into R as a data frame. I have tried using read.table as well as read_table and read_delim from the readr package but to no avail. I have tried skipping lines, changing the encoding, and everything I could think of to import the data, but I always get an error message. Does anyone know how to import this file?

英文:

I am attempting to import a .txt file into R as a data frame. I have tried using read.table as well as read_table and read_delim from the readr package but to no avail. I have tried skipping lines, changing the encoding, and everything I could think of to import the data, but I always get an error message. Does anyone know how to import this file?

答案1

得分: 3

除非GDrive出了什么问题,否则这个文件似乎更有趣,因为它似乎是以UTF-16LE编码的,而read.table()readr都需要一些帮助。文件结构也有点奇怪,双重标题,尾随的\t字符和长度不等的“空白”行。

首先让我们看看解析器是否会遇到问题:

  1. read.table("Bar 64.txt", skip = 4)
  2. #> 警告 in readLines(file, skip): line 1 appears to contain an embedded nul
  3. # ...
  4. #> 警告 in read.table("Bar 64.txt", skip = 4): line 1 appears to contain
  5. #> embedded nulls
  6. # ...
  7. #> 错误 in read.table("Bar 64.txt", skip = 4): empty beginning of file
  8. readr::read_lines("Bar 64.txt")
  9. #> 错误: The size of the connection buffer (131072) was not large enough
  10. #> to fit a complete line:
  11. #> * Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
  12. # 至少有一些进展:
  13. readr::read_file("Bar 64.txt")
  14. #> [1] "t"

现在让我们尝试修复这个问题:

  1. library(readr)
  2. library(stringr)
  3. guess_encoding("Bar 64.txt")
  4. #> # A tibble: 3 × 2
  5. #> encoding confidence
  6. #> <chr> <dbl>
  7. #> 1 UTF-16LE 1
  8. #> 2 ISO-8859-1 0.29
  9. #> 3 ISO-8859-2 0.21
  10. # 通过locale设置编码:
  11. read_file("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) %>%
  12. str_trunc(100) %>%
  13. str_view()
  14. #> [1] │ t{\t}L{\t}dL{\t}EpsL{\t}Force{\t}
  15. #> │ s{\t}mm{\t}mm{\t}%{\t}kN{\t}
  16. #> │ {\t\t\t\t\t}
  17. #> │ {\t\t}
  18. #> │ 0.037{\t}24.0060{\t}0.0000{\t}0.000{\t}0.155{\t}
  19. #> │ 0.103{\t}24.0060{\t}0.0000{\t}...
  20. # 进展了,但仍然有那些令人讨厌的尾随\t字符,会创建一个空列,第二个标题行将所有值设置为<chr>
  21. # 预期read_tsv(..., col_select = 1:5, comment = "s"))可以解决这两个问题,但是...
  22. read_tsv("Bar 64.txt", locale = locale(encoding = "UTF-16LE"),
  23. comment = "s", col_select = 1:5)
  24. #> 错误 in x:y: argument of length 0
  25. # 但我们也可以自己清理这些行:
  26. read_lines("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) %>%
  27. str_trim() %>%
  28. str_subset("^s|^$", negate = TRUE) %>%
  29. I() %>%
  30. read_tsv()
  31. #> Rows: 1803 Columns: 5
  32. #> ── Column specification ────────────────────────────────────────────────────────
  33. #> Delimiter: "\t"
  34. #> dbl (5): t, L, dL, EpsL, Force
  35. #>
  36. #> ℹ Use `spec()` to retrieve the full column specification for this data.
  37. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  38. #> # A tibble: 1,803 × 5
  39. #> t L dL EpsL Force
  40. #> <dbl> <dbl> <dbl> <dbl> <dbl>
  41. #> 1 0.037 24.0 0 0 0.155
  42. #> 2 0.103 24.0 0 0 0.142
  43. #> 3 0.637 24.0 0 0 0.185
  44. #> 4 1.09 24.0 0.0001 0.001 0.171
  45. #> 5 1.63 24.0 0 0 0.177
  46. #> 6 2.08 24.0 -0.0002 -0.001 0.153
  47. #> 7 2.62 24.0 0.0004 0.001 0.252
  48. #> 8 3.07 24.0 0.0008 0.003 0.493
  49. #> 9 3.60 24.0 0.0022 0.009 0.773
  50. #> 10 4.06 24.0 0.0032 0.014 1.05
  51. #> # ℹ 1,793 more rows

用于reprex的文件的示例:

  1. # raw_file_sample <- read_file_raw("Bar 64.txt")[1:220]
  2. raw_file_sample <- as.raw(c(0x74, 0x00, 0x09, 0x00, 0x4c, 0x00, 0x09, 0x00, 0x64,
  3. 0x00, 0x4c, 0x00, 0x09, 0x00, 0x45, 0x00, 0x70, 0x00, 0x73, 0x00,
  4. 0x4c, 0x00, 0x09, 0x00, 0x46, 0x00, 0x6f, 0x00, 0x72, 0x00, 0x63,
  5. 0x00, 0x45, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x73, 0x00, 0x09, 0x00,
  6. 0x6d, 0x00, 0x6d, 0x00, 0x09, 0x00, 0x6d, 0x00,
  7. <details>
  8. <summary>英文:</summary>
  9. Unless GDrive messes something up, this is bit more interesting as the file appears to be encoded as `UTF-16LE` and both `read.table()` and `readr` could use some help with that. The file structure is also a bit funky, double-header, trailing `\t` characters and &quot;blank&quot; rows of varying lengths.
  10. First let&#39;s see if/how parsers would struggle:
  11. ``` r
  12. read.table(&quot;Bar 64.txt&quot;, skip = 4)
  13. #&gt; Warning in readLines(file, skip): line 1 appears to contain an embedded nul
  14. # ...
  15. #&gt; Warning in read.table(&quot;Bar 64.txt&quot;, skip = 4): line 1 appears to contain
  16. #&gt; embedded nulls
  17. # ...
  18. #&gt; Error in read.table(&quot;Bar 64.txt&quot;, skip = 4): empty beginning of file
  19. readr::read_lines(&quot;Bar 64.txt&quot;)
  20. #&gt; Error: The size of the connection buffer (131072) was not large enough
  21. #&gt; to fit a complete line:
  22. #&gt; * Increase it by setting `Sys.setenv(&quot;VROOM_CONNECTION_SIZE&quot;)`
  23. # at least some progress:
  24. readr::read_file(&quot;Bar 64.txt&quot;)
  25. #&gt; [1] &quot;t&quot;

Now let's try to fix this:

  1. library(readr)
  2. library(stringr)
  3. guess_encoding(&quot;Bar 64.txt&quot;)
  4. #&gt; # A tibble: 3 &#215; 2
  5. #&gt; encoding confidence
  6. #&gt; &lt;chr&gt; &lt;dbl&gt;
  7. #&gt; 1 UTF-16LE 1
  8. #&gt; 2 ISO-8859-1 0.29
  9. #&gt; 3 ISO-8859-2 0.21
  10. # set encoding through locale:
  11. read_file(&quot;Bar 64.txt&quot;, locale = locale(encoding = &quot;UTF-16LE&quot;)) |&gt;
  12. str_trunc(100) |&gt;
  13. str_view()
  14. #&gt; [1] │ t{\t}L{\t}dL{\t}EpsL{\t}Force{\t}
  15. #&gt; │ s{\t}mm{\t}mm{\t}%{\t}kN{\t}
  16. #&gt; │ {\t\t\t\t\t}
  17. #&gt; │ {\t\t}
  18. #&gt; │ 0.037{\t}24.0060{\t}0.0000{\t}0.000{\t}0.155{\t}
  19. #&gt; │ 0.103{\t}24.0060{\t}0.0000{\t}...
  20. # progress, but there are still those annoying trailing \t characters that would
  21. # create an empty column and 2nd header row sets all values to &lt;chr&gt;
  22. # would expect read_tsv(..., col_select = 1:5, comment = &quot;s&quot;)) to fix both issues, though...
  23. read_tsv(&quot;Bar 64.txt&quot;, locale = locale(encoding = &quot;UTF-16LE&quot;),
  24. comment = &quot;s&quot;, col_select = 1:5)
  25. #&gt; Error in x:y: argument of length 0
  26. # but we can clean up those line ourselves too:
  27. read_lines(&quot;Bar 64.txt&quot;, locale = locale(encoding = &quot;UTF-16LE&quot;)) |&gt;
  28. str_trim() |&gt;
  29. str_subset(&quot;^s|^$&quot;, negate = TRUE) |&gt;
  30. I() |&gt;
  31. read_tsv()
  32. #&gt; Rows: 1803 Columns: 5
  33. #&gt; ── Column specification ────────────────────────────────────────────────────────
  34. #&gt; Delimiter: &quot;\t&quot;
  35. #&gt; dbl (5): t, L, dL, EpsL, Force
  36. #&gt;
  37. #&gt; ℹ Use `spec()` to retrieve the full column specification for this data.
  38. #&gt; ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  39. #&gt; # A tibble: 1,803 &#215; 5
  40. #&gt; t L dL EpsL Force
  41. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  42. #&gt; 1 0.037 24.0 0 0 0.155
  43. #&gt; 2 0.103 24.0 0 0 0.142
  44. #&gt; 3 0.637 24.0 0 0 0.185
  45. #&gt; 4 1.09 24.0 0.0001 0.001 0.171
  46. #&gt; 5 1.63 24.0 0 0 0.177
  47. #&gt; 6 2.08 24.0 -0.0002 -0.001 0.153
  48. #&gt; 7 2.62 24.0 0.0004 0.001 0.252
  49. #&gt; 8 3.07 24.0 0.0008 0.003 0.493
  50. #&gt; 9 3.60 24.0 0.0022 0.009 0.773
  51. #&gt; 10 4.06 24.0 0.0032 0.014 1.05
  52. #&gt; # ℹ 1,793 more rows

A sample from the file for reprex:

  1. # raw_file_sample &lt;- read_file_raw(&quot;Bar 64.txt&quot;)[1:220]
  2. raw_file_sample &lt;- as.raw(c(0x74, 0x00, 0x09, 0x00, 0x4c, 0x00, 0x09, 0x00, 0x64,
  3. 0x00, 0x4c, 0x00, 0x09, 0x00, 0x45, 0x00, 0x70, 0x00, 0x73, 0x00,
  4. 0x4c, 0x00, 0x09, 0x00, 0x46, 0x00, 0x6f, 0x00, 0x72, 0x00, 0x63,
  5. 0x00, 0x65, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x73, 0x00, 0x09, 0x00,
  6. 0x6d, 0x00, 0x6d, 0x00, 0x09, 0x00, 0x6d, 0x00, 0x6d, 0x00, 0x09,
  7. 0x00, 0x25, 0x00, 0x09, 0x00, 0x6b, 0x00, 0x4e, 0x00, 0x09, 0x00,
  8. 0x0a, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09,
  9. 0x00, 0x0a, 0x00, 0x09, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x30, 0x00,
  10. 0x2e, 0x00, 0x30, 0x00, 0x33, 0x00, 0x37, 0x00, 0x09, 0x00, 0x32,
  11. 0x00, 0x34, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30, 0x00, 0x36, 0x00,
  12. 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30,
  13. 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00,
  14. 0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e,
  15. 0x00, 0x31, 0x00, 0x35, 0x00, 0x35, 0x00, 0x09, 0x00, 0x0a, 0x00,
  16. 0x30, 0x00, 0x2e, 0x00, 0x31, 0x00, 0x30, 0x00, 0x33, 0x00, 0x09,
  17. 0x00, 0x32, 0x00, 0x34, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30, 0x00,
  18. 0x36, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00, 0x30,
  19. 0x00, 0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00,
  20. 0x2e, 0x00, 0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30,
  21. 0x00, 0x2e, 0x00, 0x31, 0x00, 0x34, 0x00, 0x32, 0x00, 0x09, 0x00,
  22. 0x0a, 0x00))
  23. write_file(raw_file_sample, &quot;Bar 64.txt&quot;)

<sup>Created on 2023-06-13 with reprex v2.0.2</sup>

答案2

得分: 0

我们可以两次读取文件,第一次获取列名,然后通过跳过来获取数据行,并手动设置列名:

  1. x <- "t L dL EpsL Force
  2. s mm mm % kN
  3. 0.037 24.0060 0.0000 0.000 0.155
  4. 0.103 24.0060 0.0000 0.000 0.142
  5. 0.637 24.0060 0.0000 0.000 0.185
  6. "
  7. # 将文本替换为 x,并使用文件名替换为 "myFileName.txt"
  8. read.table(text = x, skip = 4,
  9. col.names = names(read.table(text = x, nrows = 1, header = TRUE)))
  10. # t L dL EpsL Force
  11. # 1 0.037 24.006 0 0 0.155
  12. # 2 0.103 24.006 0 0 0.142
  13. # 3 0.637 24.006 0 0 0.185
英文:

We can read the file twice, once to get the column names, then with skip to get the data rows, and set the names manually:

  1. x &lt;- &quot;t L dL EpsL Force
  2. s mm mm % kN
  3. 0.037 24.0060 0.0000 0.000 0.155
  4. 0.103 24.0060 0.0000 0.000 0.142
  5. 0.637 24.0060 0.0000 0.000 0.185
  6. &quot;
  7. #replace text = x with file = &quot;myFileName.txt&quot;
  8. read.table(text = x, skip = 4,
  9. col.names = names(read.table(text = x, nrows = 1, header = TRUE)))
  10. # t L dL EpsL Force
  11. # 1 0.037 24.006 0 0 0.155
  12. # 2 0.103 24.006 0 0 0.142
  13. # 3 0.637 24.006 0 0 0.185

huangapple
  • 本文由 发表于 2023年6月13日 14:42:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76462674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定