英文:
Importing a .txt file into R
问题
I am attempting to import a .txt file into R as a data frame. I have tried using read.table as well as read_table and read_delim from the readr package but to no avail. I have tried skipping lines, changing the encoding, and everything I could think of to import the data, but I always get an error message. Does anyone know how to import this file?
英文:
I am attempting to import a .txt file into R as a data frame. I have tried using read.table as well as read_table and read_delim from the readr package but to no avail. I have tried skipping lines, changing the encoding, and everything I could think of to import the data, but I always get an error message. Does anyone know how to import this file?
答案1
得分: 3
除非GDrive出了什么问题,否则这个文件似乎更有趣,因为它似乎是以UTF-16LE
编码的,而read.table()
和readr
都需要一些帮助。文件结构也有点奇怪,双重标题,尾随的\t
字符和长度不等的“空白”行。
首先让我们看看解析器是否会遇到问题:
read.table("Bar 64.txt", skip = 4)
#> 警告 in readLines(file, skip): line 1 appears to contain an embedded nul
# ...
#> 警告 in read.table("Bar 64.txt", skip = 4): line 1 appears to contain
#> embedded nulls
# ...
#> 错误 in read.table("Bar 64.txt", skip = 4): empty beginning of file
readr::read_lines("Bar 64.txt")
#> 错误: The size of the connection buffer (131072) was not large enough
#> to fit a complete line:
#> * Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
# 至少有一些进展:
readr::read_file("Bar 64.txt")
#> [1] "t"
现在让我们尝试修复这个问题:
library(readr)
library(stringr)
guess_encoding("Bar 64.txt")
#> # A tibble: 3 × 2
#> encoding confidence
#> <chr> <dbl>
#> 1 UTF-16LE 1
#> 2 ISO-8859-1 0.29
#> 3 ISO-8859-2 0.21
# 通过locale设置编码:
read_file("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) %>%
str_trunc(100) %>%
str_view()
#> [1] │ t{\t}L{\t}dL{\t}EpsL{\t}Force{\t}
#> │ s{\t}mm{\t}mm{\t}%{\t}kN{\t}
#> │ {\t\t\t\t\t}
#> │ {\t\t}
#> │ 0.037{\t}24.0060{\t}0.0000{\t}0.000{\t}0.155{\t}
#> │ 0.103{\t}24.0060{\t}0.0000{\t}...
# 进展了,但仍然有那些令人讨厌的尾随\t字符,会创建一个空列,第二个标题行将所有值设置为<chr>
# 预期read_tsv(..., col_select = 1:5, comment = "s"))可以解决这两个问题,但是...
read_tsv("Bar 64.txt", locale = locale(encoding = "UTF-16LE"),
comment = "s", col_select = 1:5)
#> 错误 in x:y: argument of length 0
# 但我们也可以自己清理这些行:
read_lines("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) %>%
str_trim() %>%
str_subset("^s|^$", negate = TRUE) %>%
I() %>%
read_tsv()
#> Rows: 1803 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (5): t, L, dL, EpsL, Force
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,803 × 5
#> t L dL EpsL Force
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.037 24.0 0 0 0.155
#> 2 0.103 24.0 0 0 0.142
#> 3 0.637 24.0 0 0 0.185
#> 4 1.09 24.0 0.0001 0.001 0.171
#> 5 1.63 24.0 0 0 0.177
#> 6 2.08 24.0 -0.0002 -0.001 0.153
#> 7 2.62 24.0 0.0004 0.001 0.252
#> 8 3.07 24.0 0.0008 0.003 0.493
#> 9 3.60 24.0 0.0022 0.009 0.773
#> 10 4.06 24.0 0.0032 0.014 1.05
#> # ℹ 1,793 more rows
用于reprex的文件的示例:
# raw_file_sample <- read_file_raw("Bar 64.txt")[1:220]
raw_file_sample <- as.raw(c(0x74, 0x00, 0x09, 0x00, 0x4c, 0x00, 0x09, 0x00, 0x64,
0x00, 0x4c, 0x00, 0x09, 0x00, 0x45, 0x00, 0x70, 0x00, 0x73, 0x00,
0x4c, 0x00, 0x09, 0x00, 0x46, 0x00, 0x6f, 0x00, 0x72, 0x00, 0x63,
0x00, 0x45, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x73, 0x00, 0x09, 0x00,
0x6d, 0x00, 0x6d, 0x00, 0x09, 0x00, 0x6d, 0x00,
<details>
<summary>英文:</summary>
Unless GDrive messes something up, this is bit more interesting as the file appears to be encoded as `UTF-16LE` and both `read.table()` and `readr` could use some help with that. The file structure is also a bit funky, double-header, trailing `\t` characters and "blank" rows of varying lengths.
First let's see if/how parsers would struggle:
``` r
read.table("Bar 64.txt", skip = 4)
#> Warning in readLines(file, skip): line 1 appears to contain an embedded nul
# ...
#> Warning in read.table("Bar 64.txt", skip = 4): line 1 appears to contain
#> embedded nulls
# ...
#> Error in read.table("Bar 64.txt", skip = 4): empty beginning of file
readr::read_lines("Bar 64.txt")
#> Error: The size of the connection buffer (131072) was not large enough
#> to fit a complete line:
#> * Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
# at least some progress:
readr::read_file("Bar 64.txt")
#> [1] "t"
Now let's try to fix this:
library(readr)
library(stringr)
guess_encoding("Bar 64.txt")
#> # A tibble: 3 × 2
#> encoding confidence
#> <chr> <dbl>
#> 1 UTF-16LE 1
#> 2 ISO-8859-1 0.29
#> 3 ISO-8859-2 0.21
# set encoding through locale:
read_file("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) |>
str_trunc(100) |>
str_view()
#> [1] │ t{\t}L{\t}dL{\t}EpsL{\t}Force{\t}
#> │ s{\t}mm{\t}mm{\t}%{\t}kN{\t}
#> │ {\t\t\t\t\t}
#> │ {\t\t}
#> │ 0.037{\t}24.0060{\t}0.0000{\t}0.000{\t}0.155{\t}
#> │ 0.103{\t}24.0060{\t}0.0000{\t}...
# progress, but there are still those annoying trailing \t characters that would
# create an empty column and 2nd header row sets all values to <chr>
# would expect read_tsv(..., col_select = 1:5, comment = "s")) to fix both issues, though...
read_tsv("Bar 64.txt", locale = locale(encoding = "UTF-16LE"),
comment = "s", col_select = 1:5)
#> Error in x:y: argument of length 0
# but we can clean up those line ourselves too:
read_lines("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) |>
str_trim() |>
str_subset("^s|^$", negate = TRUE) |>
I() |>
read_tsv()
#> Rows: 1803 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (5): t, L, dL, EpsL, Force
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,803 × 5
#> t L dL EpsL Force
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.037 24.0 0 0 0.155
#> 2 0.103 24.0 0 0 0.142
#> 3 0.637 24.0 0 0 0.185
#> 4 1.09 24.0 0.0001 0.001 0.171
#> 5 1.63 24.0 0 0 0.177
#> 6 2.08 24.0 -0.0002 -0.001 0.153
#> 7 2.62 24.0 0.0004 0.001 0.252
#> 8 3.07 24.0 0.0008 0.003 0.493
#> 9 3.60 24.0 0.0022 0.009 0.773
#> 10 4.06 24.0 0.0032 0.014 1.05
#> # ℹ 1,793 more rows
A sample from the file for reprex:
# raw_file_sample <- read_file_raw("Bar 64.txt")[1:220]
raw_file_sample <- as.raw(c(0x74, 0x00, 0x09, 0x00, 0x4c, 0x00, 0x09, 0x00, 0x64,
0x00, 0x4c, 0x00, 0x09, 0x00, 0x45, 0x00, 0x70, 0x00, 0x73, 0x00,
0x4c, 0x00, 0x09, 0x00, 0x46, 0x00, 0x6f, 0x00, 0x72, 0x00, 0x63,
0x00, 0x65, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x73, 0x00, 0x09, 0x00,
0x6d, 0x00, 0x6d, 0x00, 0x09, 0x00, 0x6d, 0x00, 0x6d, 0x00, 0x09,
0x00, 0x25, 0x00, 0x09, 0x00, 0x6b, 0x00, 0x4e, 0x00, 0x09, 0x00,
0x0a, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09,
0x00, 0x0a, 0x00, 0x09, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x30, 0x00,
0x2e, 0x00, 0x30, 0x00, 0x33, 0x00, 0x37, 0x00, 0x09, 0x00, 0x32,
0x00, 0x34, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30, 0x00, 0x36, 0x00,
0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30,
0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00,
0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e,
0x00, 0x31, 0x00, 0x35, 0x00, 0x35, 0x00, 0x09, 0x00, 0x0a, 0x00,
0x30, 0x00, 0x2e, 0x00, 0x31, 0x00, 0x30, 0x00, 0x33, 0x00, 0x09,
0x00, 0x32, 0x00, 0x34, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30, 0x00,
0x36, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00, 0x30,
0x00, 0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00,
0x2e, 0x00, 0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30,
0x00, 0x2e, 0x00, 0x31, 0x00, 0x34, 0x00, 0x32, 0x00, 0x09, 0x00,
0x0a, 0x00))
write_file(raw_file_sample, "Bar 64.txt")
<sup>Created on 2023-06-13 with reprex v2.0.2</sup>
答案2
得分: 0
我们可以两次读取文件,第一次获取列名,然后通过跳过来获取数据行,并手动设置列名:
x <- "t L dL EpsL Force
s mm mm % kN
0.037 24.0060 0.0000 0.000 0.155
0.103 24.0060 0.0000 0.000 0.142
0.637 24.0060 0.0000 0.000 0.185
"
# 将文本替换为 x,并使用文件名替换为 "myFileName.txt"
read.table(text = x, skip = 4,
col.names = names(read.table(text = x, nrows = 1, header = TRUE)))
# t L dL EpsL Force
# 1 0.037 24.006 0 0 0.155
# 2 0.103 24.006 0 0 0.142
# 3 0.637 24.006 0 0 0.185
英文:
We can read the file twice, once to get the column names, then with skip to get the data rows, and set the names manually:
x <- "t L dL EpsL Force
s mm mm % kN
0.037 24.0060 0.0000 0.000 0.155
0.103 24.0060 0.0000 0.000 0.142
0.637 24.0060 0.0000 0.000 0.185
"
#replace text = x with file = "myFileName.txt"
read.table(text = x, skip = 4,
col.names = names(read.table(text = x, nrows = 1, header = TRUE)))
# t L dL EpsL Force
# 1 0.037 24.006 0 0 0.155
# 2 0.103 24.006 0 0 0.142
# 3 0.637 24.006 0 0 0.185
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论