Importing a .txt file into R.

huangapple go评论73阅读模式
英文:

Importing a .txt file into R

问题

I am attempting to import a .txt file into R as a data frame. I have tried using read.table as well as read_table and read_delim from the readr package but to no avail. I have tried skipping lines, changing the encoding, and everything I could think of to import the data, but I always get an error message. Does anyone know how to import this file?

英文:

I am attempting to import a .txt file into R as a data frame. I have tried using read.table as well as read_table and read_delim from the readr package but to no avail. I have tried skipping lines, changing the encoding, and everything I could think of to import the data, but I always get an error message. Does anyone know how to import this file?

答案1

得分: 3

除非GDrive出了什么问题,否则这个文件似乎更有趣,因为它似乎是以UTF-16LE编码的,而read.table()readr都需要一些帮助。文件结构也有点奇怪,双重标题,尾随的\t字符和长度不等的“空白”行。

首先让我们看看解析器是否会遇到问题:

read.table("Bar 64.txt", skip = 4)
#> 警告 in readLines(file, skip): line 1 appears to contain an embedded nul
#  ...
#> 警告 in read.table("Bar 64.txt", skip = 4): line 1 appears to contain
#> embedded nulls
#  ...
#> 错误 in read.table("Bar 64.txt", skip = 4): empty beginning of file

readr::read_lines("Bar 64.txt")
#> 错误: The size of the connection buffer (131072) was not large enough
#> to fit a complete line:
#>   * Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`

# 至少有一些进展:
readr::read_file("Bar 64.txt")
#> [1] "t"

现在让我们尝试修复这个问题:

library(readr)
library(stringr)

guess_encoding("Bar 64.txt")
#> # A tibble: 3 × 2
#>   encoding   confidence
#>   <chr>           <dbl>
#> 1 UTF-16LE         1   
#> 2 ISO-8859-1       0.29
#> 3 ISO-8859-2       0.21

# 通过locale设置编码:
read_file("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) %>%
  str_trunc(100) %>%
  str_view()
#> [1] │ t{\t}L{\t}dL{\t}EpsL{\t}Force{\t}
#>     │ s{\t}mm{\t}mm{\t}%{\t}kN{\t}
#>     │ {\t\t\t\t\t}
#>     │ {\t\t}
#>     │ 0.037{\t}24.0060{\t}0.0000{\t}0.000{\t}0.155{\t}
#>     │ 0.103{\t}24.0060{\t}0.0000{\t}...

# 进展了,但仍然有那些令人讨厌的尾随\t字符,会创建一个空列,第二个标题行将所有值设置为<chr>
# 预期read_tsv(..., col_select = 1:5, comment = "s"))可以解决这两个问题,但是...
read_tsv("Bar 64.txt", locale = locale(encoding = "UTF-16LE"), 
          comment = "s", col_select = 1:5)
#> 错误 in x:y: argument of length 0

# 但我们也可以自己清理这些行:
read_lines("Bar 64.txt", locale = locale(encoding = "UTF-16LE")) %>%
  str_trim() %>%
  str_subset("^s|^$", negate = TRUE) %>%
  I() %>%
  read_tsv()
#> Rows: 1803 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (5): t, L, dL, EpsL, Force
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,803 × 5
#>        t     L      dL   EpsL Force
#>    <dbl> <dbl>   <dbl>  <dbl> <dbl>
#>  1 0.037  24.0  0       0     0.155
#>  2 0.103  24.0  0       0     0.142
#>  3 0.637  24.0  0       0     0.185
#>  4 1.09   24.0  0.0001  0.001 0.171
#>  5 1.63   24.0  0       0     0.177
#>  6 2.08   24.0 -0.0002 -0.001 0.153
#>  7 2.62   24.0  0.0004  0.001 0.252
#>  8 3.07   24.0  0.0008  0.003 0.493
#>  9 3.60   24.0  0.0022  0.009 0.773
#> 10 4.06   24.0  0.0032  0.014 1.05 
#> # ℹ 1,793 more rows

用于reprex的文件的示例:

# raw_file_sample <- read_file_raw("Bar 64.txt")[1:220]
raw_file_sample <- as.raw(c(0x74, 0x00, 0x09, 0x00, 0x4c, 0x00, 0x09, 0x00, 0x64, 
0x00, 0x4c, 0x00, 0x09, 0x00, 0x45, 0x00, 0x70, 0x00, 0x73, 0x00, 
0x4c, 0x00, 0x09, 0x00, 0x46, 0x00, 0x6f, 0x00, 0x72, 0x00, 0x63, 
0x00, 0x45, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x73, 0x00, 0x09, 0x00, 
0x6d, 0x00, 0x6d, 0x00, 0x09, 0x00, 0x6d, 0x00,

<details>
<summary>英文:</summary>

Unless GDrive messes something up, this is bit more interesting as the file  appears to be encoded as `UTF-16LE` and both `read.table()` and `readr` could use some help with that. The file structure is also a bit funky, double-header, trailing `\t` characters and &quot;blank&quot; rows of varying lengths. 

First let&#39;s see if/how parsers would struggle:
``` r
read.table(&quot;Bar 64.txt&quot;, skip = 4)
#&gt; Warning in readLines(file, skip): line 1 appears to contain an embedded nul
#  ...
#&gt; Warning in read.table(&quot;Bar 64.txt&quot;, skip = 4): line 1 appears to contain
#&gt; embedded nulls
#  ...
#&gt; Error in read.table(&quot;Bar 64.txt&quot;, skip = 4): empty beginning of file

readr::read_lines(&quot;Bar 64.txt&quot;)
#&gt; Error: The size of the connection buffer (131072) was not large enough
#&gt; to fit a complete line:
#&gt;   * Increase it by setting `Sys.setenv(&quot;VROOM_CONNECTION_SIZE&quot;)`

# at least some progress:
readr::read_file(&quot;Bar 64.txt&quot;)
#&gt; [1] &quot;t&quot;

Now let's try to fix this:

library(readr)
library(stringr)

guess_encoding(&quot;Bar 64.txt&quot;)
#&gt; # A tibble: 3 &#215; 2
#&gt;   encoding   confidence
#&gt;   &lt;chr&gt;           &lt;dbl&gt;
#&gt; 1 UTF-16LE         1   
#&gt; 2 ISO-8859-1       0.29
#&gt; 3 ISO-8859-2       0.21

# set encoding through locale:
read_file(&quot;Bar 64.txt&quot;, locale = locale(encoding = &quot;UTF-16LE&quot;)) |&gt; 
  str_trunc(100) |&gt; 
  str_view()
#&gt; [1] │ t{\t}L{\t}dL{\t}EpsL{\t}Force{\t}
#&gt;     │ s{\t}mm{\t}mm{\t}%{\t}kN{\t}
#&gt;     │ {\t\t\t\t\t}
#&gt;     │ {\t\t}
#&gt;     │ 0.037{\t}24.0060{\t}0.0000{\t}0.000{\t}0.155{\t}
#&gt;     │ 0.103{\t}24.0060{\t}0.0000{\t}...

# progress, but there are still those annoying trailing \t characters that would
# create an empty column and 2nd header row sets all values to &lt;chr&gt;
# would expect read_tsv(..., col_select = 1:5, comment = &quot;s&quot;)) to fix both issues, though...
read_tsv(&quot;Bar 64.txt&quot;, locale = locale(encoding = &quot;UTF-16LE&quot;), 
          comment = &quot;s&quot;, col_select = 1:5)
#&gt; Error in x:y: argument of length 0

# but we can clean up those line ourselves too:
read_lines(&quot;Bar 64.txt&quot;, locale = locale(encoding = &quot;UTF-16LE&quot;)) |&gt; 
  str_trim() |&gt;
  str_subset(&quot;^s|^$&quot;, negate = TRUE) |&gt;
  I() |&gt;
  read_tsv()
#&gt; Rows: 1803 Columns: 5
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; Delimiter: &quot;\t&quot;
#&gt; dbl (5): t, L, dL, EpsL, Force
#&gt; 
#&gt; ℹ Use `spec()` to retrieve the full column specification for this data.
#&gt; ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#&gt; # A tibble: 1,803 &#215; 5
#&gt;        t     L      dL   EpsL Force
#&gt;    &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;
#&gt;  1 0.037  24.0  0       0     0.155
#&gt;  2 0.103  24.0  0       0     0.142
#&gt;  3 0.637  24.0  0       0     0.185
#&gt;  4 1.09   24.0  0.0001  0.001 0.171
#&gt;  5 1.63   24.0  0       0     0.177
#&gt;  6 2.08   24.0 -0.0002 -0.001 0.153
#&gt;  7 2.62   24.0  0.0004  0.001 0.252
#&gt;  8 3.07   24.0  0.0008  0.003 0.493
#&gt;  9 3.60   24.0  0.0022  0.009 0.773
#&gt; 10 4.06   24.0  0.0032  0.014 1.05 
#&gt; # ℹ 1,793 more rows

A sample from the file for reprex:

# raw_file_sample &lt;- read_file_raw(&quot;Bar 64.txt&quot;)[1:220]
raw_file_sample &lt;- as.raw(c(0x74, 0x00, 0x09, 0x00, 0x4c, 0x00, 0x09, 0x00, 0x64, 
0x00, 0x4c, 0x00, 0x09, 0x00, 0x45, 0x00, 0x70, 0x00, 0x73, 0x00, 
0x4c, 0x00, 0x09, 0x00, 0x46, 0x00, 0x6f, 0x00, 0x72, 0x00, 0x63, 
0x00, 0x65, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x73, 0x00, 0x09, 0x00, 
0x6d, 0x00, 0x6d, 0x00, 0x09, 0x00, 0x6d, 0x00, 0x6d, 0x00, 0x09, 
0x00, 0x25, 0x00, 0x09, 0x00, 0x6b, 0x00, 0x4e, 0x00, 0x09, 0x00, 
0x0a, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09, 0x00, 0x09, 
0x00, 0x0a, 0x00, 0x09, 0x00, 0x09, 0x00, 0x0a, 0x00, 0x30, 0x00, 
0x2e, 0x00, 0x30, 0x00, 0x33, 0x00, 0x37, 0x00, 0x09, 0x00, 0x32, 
0x00, 0x34, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30, 0x00, 0x36, 0x00, 
0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30, 
0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00, 
0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 
0x00, 0x31, 0x00, 0x35, 0x00, 0x35, 0x00, 0x09, 0x00, 0x0a, 0x00, 
0x30, 0x00, 0x2e, 0x00, 0x31, 0x00, 0x30, 0x00, 0x33, 0x00, 0x09, 
0x00, 0x32, 0x00, 0x34, 0x00, 0x2e, 0x00, 0x30, 0x00, 0x30, 0x00, 
0x36, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 0x2e, 0x00, 0x30, 
0x00, 0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 0x00, 
0x2e, 0x00, 0x30, 0x00, 0x30, 0x00, 0x30, 0x00, 0x09, 0x00, 0x30, 
0x00, 0x2e, 0x00, 0x31, 0x00, 0x34, 0x00, 0x32, 0x00, 0x09, 0x00, 
0x0a, 0x00))
write_file(raw_file_sample, &quot;Bar 64.txt&quot;)

<sup>Created on 2023-06-13 with reprex v2.0.2</sup>

答案2

得分: 0

我们可以两次读取文件,第一次获取列名,然后通过跳过来获取数据行,并手动设置列名:

x <- "t	L	dL	EpsL	Force	
s	mm	mm	%	kN	

0.037	24.0060	0.0000	0.000	0.155	
0.103	24.0060	0.0000	0.000	0.142	
0.637	24.0060	0.0000	0.000	0.185	
"

# 将文本替换为 x,并使用文件名替换为 "myFileName.txt"
read.table(text = x, skip = 4, 
           col.names = names(read.table(text = x, nrows = 1, header = TRUE)))

#       t      L dL EpsL Force
# 1 0.037 24.006  0    0 0.155
# 2 0.103 24.006  0    0 0.142
# 3 0.637 24.006  0    0 0.185
英文:

We can read the file twice, once to get the column names, then with skip to get the data rows, and set the names manually:

x &lt;- &quot;t	L	dL	EpsL	Force	
s	mm	mm	%	kN	
0.037	24.0060	0.0000	0.000	0.155	
0.103	24.0060	0.0000	0.000	0.142	
0.637	24.0060	0.0000	0.000	0.185	
&quot;
#replace text = x with file = &quot;myFileName.txt&quot;
read.table(text = x, skip = 4, 
col.names = names(read.table(text = x, nrows = 1, header = TRUE)))
#       t      L dL EpsL Force
# 1 0.037 24.006  0    0 0.155
# 2 0.103 24.006  0    0 0.142
# 3 0.637 24.006  0    0 0.185

huangapple
  • 本文由 发表于 2023年6月13日 14:42:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76462674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定