2023年7月17日 21:00:03go评论116阅读模式

英文:

Can not read file in data.table?

问题

我从UNCTAD下载了一个文件，然后解压缩，再使用data.table包进行读取。

https://unctadstat.unctad.org/7zip/US_TradeMatrix_Part5.csv.7z

temp <- data.table::fread("US_TradeMatrix_Part5_ST202209201450_v1.csv", 
      drop = c("US dollars at current prices in thousands Footnote", "Flow"), 
      check.names = TRUE, showProgress = TRUE)

我遇到了以下错误
R字符字符串限制为2^31-1字节
请建议。

英文:

i downloaded a file from UNCTAD and unzip then i read using data.table package

https://unctadstat.unctad.org/7zip/US_TradeMatrix_Part5.csv.7z

temp &lt;- data.table::fread(&quot;US_TradeMatrix_Part5_ST202209201450_v1.csv&quot;, 
      drop = c(&quot;US dollars at current prices in thousands Footnote&quot;, &quot;Flow&quot;), 
      check.names = TRUE, showProgress = TRUE)

I get the following error
R character strings are limited to 2^31-1 bytes
please advise.

答案1

得分: 2

错误很可能是因为您的文件没有足够的内存（RAM）来处理它。我有64GB的RAM，读取文件花了超过6分钟，结果的对象在R中超过13GB（原始文件未压缩时超过20GB）。我没有收到错误消息。

由于看起来您无法一次将整个数据集读入内存，我建议您从获取数据的替代方法中受益。这些方法都不允许您将所有数据读入R：无论您做什么都不会允许这样做。（从技术上讲，您可以select=加载的列的子集，但这假定您知道您永远不会使用“大内存”列。）

加载到数据库中。可以使用“大服务器”DBMS，如postgres、mariadb或sql server。这不仅仅是“只是R”，超出了这个问题的范围。但是，可以使用RSQLite或DuckDB（单文件数据库）。例如，在shell中（不是R）：

SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/

$ sqlite3 US_TradeMatrix.sqlite3
SQLite version 3.40.1 2022-12-28 14:03:47
输入“.help”获取使用提示。
sqlite&gt; .mode csv
sqlite&gt; .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix

DuckDB: https://duckdb.org/docs/data/csv/overview.html

（请注意，前导的 D 是提示字符串，不是用户输入的。）

$ duckdb US_TradeMatrix.duckdb
v0.8.1 6536a77232
输入“.help”获取使用提示。
D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR)
D COPY TradeMatrix FROM 'US_TradeMatrix_Part5_ST202306191712_v1.csv' (AUTO_DETECT TRUE);

我对表模式进行了猜测，请验证。

从这里开始，可以使用DBI和RSQLite或duckdb来制定在尝试引入R之前减少数据（并可选执行其他操作）的SQL语句。

使用duckdb对CSV文件本身进行惰性操作，仍然需要SQL。（请注意，这可能不像导入到duckdb文件中那样快速，如上述第1种方法。我没有对其进行基准测试，但我无法想象在CSV文件上进行随机访问与索引数据库的效率相比如何。）

duck <- DBI::dbConnect(duckdb::duckdb())
DBI::dbGetQuery(duck, "select * from 'US_TradeMatrix_Part5_ST202306191712_v1.csv' limit 10")
#    Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product                                    SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote
# 1  2016       2016    0000         World    0000         World   01    Imports               0                                    Food and live animals                                1068138072                                               <NA>
# 2  2016       2016    0000         World    0000         World   01    Imports              00           Live animals other than animals of division 03                                  21120296                                               <NA>
# 3  2016       2016    0000         World    0000         World   01    Imports             001           Live animals other than animals of division 03                                  21120296                                               <NA>
# 4  2016       2016    0000         World    0000         World   01    Imports              01                               Meat and meat preparations                                 128581301                                               <NA>
# 5  2016       2016    0000         World    0000         World   01    Imports             011         Meat of bovine animals, fresh, chilled or frozen                                  39233046                                               <NA>
# 6  2016       2016    0000         World    0000         World   01    Imports             012                         Other meat and edible meat offal                                  66138000                                               <NA>
# 7  2016       2016    0000         World    0000         World   01    Imports             016    Meat, edible meat offal, salted, dried; flours, meals                                   4135049                                               <NA>
# 8  2016       2016    0000         World    0000         World   01    Imports             017     Meat, edible meat offal, prepared, preserved, n.e.s.                                  19075207                                               <NA>
# 9  2016       2016    0000         World    0000         World   01    Imports              02                           Dairy products and birds' eggs                                  76369447                                               <NA>
# 10 2016       2016    0000         World    0000         World   01    Imports             022 Milk, cream and milk products (excluding butter, cheese)                                  37031756                                               <NA>

使用arrow进行惰性操作，使用dplyr作为语言：

tradematrix <- arrow::open_csv_dataset("US_TradeMatrix_Part5_ST202306191712_v1.csv")
tradematrix %>%
  head(n=10) %>%
  collect()
# # A tibble: 10 × 12
#     Year `Year Label` Economy `Economy Label` Partner `Partner Label`  Flow `Flow Label` SitcRev3Product `SitcRev3Product Label`                                  `US dollars at current prices in thousands` US dollars at current prices…¹
#    <int>        <int>   <int> <chr>           <chr>   <chr>           <int> <chr>        <chr>           <chr>                                                                                          <dbl>                          <???>
#  1  2016         2016       0 World           0000    World               1 Imports      0

英文:

The error is likely because your file does not have sufficient memory (RAM) to process it. I have 64GB of RAM, it took well over 6 minutes to read in the file, and the resulting object is over 13GB in R (the original file is over 20GB in size, uncompressed). I received no error.

Since it appears you cannot read that entire dataset into RAM at one time, I suggest you would benefit from alternative methods of getting at your data. None of these allow you to read all data into R: nothing you can do will allow that. (Technically, you can select= a subset of the columns to load in, but that assumes there are "large memory" columns you know you'll never use.)

Load into a database. One could use a "big server" DBMS such as postgres, mariadb, or sql server. Doing this is a lot more than "just R", outside the scope of this question. However, one might be able to use RSQLite or DuckDB (single-file databases). For instance, on the shell (not R):
- SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/
```
$ sqlite3 US_TradeMatrix.sqlite3
SQLite version 3.40.1 2022-12-28 14:03:47
Enter &quot;.help&quot; for usage hints.
sqlite&gt; .mode csv
sqlite&gt; .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix
```
- DuckDB: https://duckdb.org/docs/data/csv/overview.html
  
  (Note that the leading D is the prompt string, not typed in by the user.)
```
$ duckdb US_TradeMatrix.duckdb
v0.8.1 6536a77232
Enter &quot;.help&quot; for usage hints.
D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR)
D COPY TradeMatrix FROM &#39;US_TradeMatrix_Part5_ST202306191712_v1.csv&#39; (AUTO_DETECT TRUE);
```
  I made guesses on the table schema, please verify.
From here, one would use DBI and either RSQLite or duckdb to fashion SQL statements that reduce the data (and optionally do other things) before attempting to bring into R.

Lazy operations using duckdb on the CSV file itself, still requiring SQL. (Note that this might not be as fast as importing into a duckdb file as in #1 above. I haven't benchmarked it, but I cannot imagine random-access on CSV files would compare well with indexed database efficiencies.)

duck &lt;- DBI::dbConnect(duckdb::duckdb())
DBI::dbGetQuery(duck, &quot;select * from &#39;US_TradeMatrix_Part5_ST202306191712_v1.csv&#39; limit 10&quot;)
#    Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product                                    SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote
# 1  2016       2016    0000         World    0000         World   01    Imports               0                                    Food and live animals                                1068138072                                               &lt;NA&gt;
# 2  2016       2016    0000         World    0000         World   01    Imports              00           Live animals other than animals of division 03                                  21120296                                               &lt;NA&gt;
# 3  2016       2016    0000         World    0000         World   01    Imports             001           Live animals other than animals of division 03                                  21120296                                               &lt;NA&gt;
# 4  2016       2016    0000         World    0000         World   01    Imports              01                               Meat and meat preparations                                 128581301                                               &lt;NA&gt;
# 5  2016       2016    0000         World    0000         World   01    Imports             011         Meat of bovine animals, fresh, chilled or frozen                                  39233046                                               &lt;NA&gt;
# 6  2016       2016    0000         World    0000         World   01    Imports             012                         Other meat and edible meat offal                                  66138000                                               &lt;NA&gt;
# 7  2016       2016    0000         World    0000         World   01    Imports             016    Meat, edible meat offal, salted, dried; flours, meals                                   4135049                                               &lt;NA&gt;
# 8  2016       2016    0000         World    0000         World   01    Imports             017     Meat, edible meat offal, prepared, preserved, n.e.s.                                  19075207                                               &lt;NA&gt;
# 9  2016       2016    0000         World    0000         World   01    Imports              02                           Dairy products and birds&#39; eggs                                  76369447                                               &lt;NA&gt;
# 10 2016       2016    0000         World    0000         World   01    Imports             022 Milk, cream and milk products (excluding butter, cheese)                                  37031756                                               &lt;NA&gt;

Lazy operations with arrow, using dplyr as the language:

tradematrix &lt;- arrow::open_csv_dataset(&quot;US_TradeMatrix_Part5_ST202306191712_v1.csv&quot;)
tradematrix %&gt;%
  head(n=10) %&gt;%
  collect()
# # A tibble: 10 &#215; 12
#     Year `Year Label` Economy `Economy Label` Partner `Partner Label`  Flow `Flow Label` SitcRev3Product `SitcRev3Product Label`                                  `US dollars at current prices in thousands` US dollars at current prices…&#185;
#    &lt;int&gt;        &lt;int&gt;   &lt;int&gt; &lt;chr&gt;           &lt;chr&gt;   &lt;chr&gt;           &lt;int&gt; &lt;chr&gt;        &lt;chr&gt;           &lt;chr&gt;                                                                                          &lt;dbl&gt;                          &lt;???&gt;
#  1  2016         2016       0 World           0000    World               1 Imports      0               Food and live animals                                                                    1068138072.                              .
#  2  2016         2016       0 World           0000    World               1 Imports      00              Live animals other than animals of division 03                                             21120296.                              .
#  3  2016         2016       0 World           0000    World               1 Imports      001             Live animals other than animals of division 03                                             21120296.                              .
#  4  2016         2016       0 World           0000    World               1 Imports      01              Meat and meat preparations                                                                128581301.                              .
#  5  2016         2016       0 World           0000    World               1 Imports      011             Meat of bovine animals, fresh, chilled or frozen                                           39233046.                              .
#  6  2016         2016       0 World           0000    World               1 Imports      012             Other meat and edible meat offal                                                           66138000.                              .
#  7  2016         2016       0 World           0000    World               1 Imports      016             Meat, edible meat offal, salted, dried; flours, meals                                       4135049.                              .
#  8  2016         2016       0 World           0000    World               1 Imports      017             Meat, edible meat offal, prepared, preserved, n.e.s.                                       19075207.                              .
#  9  2016         2016       0 World           0000    World               1 Imports      02              Dairy products and birds&#39; eggs                                                             76369447.                              .
# 10  2016         2016       0 World           0000    World               1 Imports      022             Milk, cream and milk products (excluding butter, cheese)                                   37031756.                              .
# # ℹ abbreviated name: &#185;`US dollars at current prices in thousands Footnote`

This method supports most of dplyrs verbs, and many base R functions within mutate and summarize.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法在data.table中读取文件？

问题

答案1

使用openxlsx将数据框格式化为Excel。

按变量分组计算有效/非NA观测值的数量。

将apply函数与lapply结合使用：计算数据框中各组的均值。

如何将geom_label_repel标签居中，使其位于每个条形图的中间？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。