英文:
Can not read file in data.table?
问题
我从UNCTAD下载了一个文件,然后解压缩,再使用data.table包进行读取。
https://unctadstat.unctad.org/7zip/US_TradeMatrix_Part5.csv.7z
temp <- data.table::fread("US_TradeMatrix_Part5_ST202209201450_v1.csv",
drop = c("US dollars at current prices in thousands Footnote", "Flow"),
check.names = TRUE, showProgress = TRUE)
我遇到了以下错误
R字符字符串限制为2^31-1字节
请建议。
英文:
i downloaded a file from UNCTAD and unzip then i read using data.table package
https://unctadstat.unctad.org/7zip/US_TradeMatrix_Part5.csv.7z
temp <- data.table::fread("US_TradeMatrix_Part5_ST202209201450_v1.csv",
drop = c("US dollars at current prices in thousands Footnote", "Flow"),
check.names = TRUE, showProgress = TRUE)
I get the following error
R character strings are limited to 2^31-1 bytes
please advise.
答案1
得分: 2
错误很可能是因为您的文件没有足够的内存(RAM)来处理它。我有64GB的RAM,读取文件花了超过6分钟,结果的对象在R中超过13GB(原始文件未压缩时超过20GB)。我没有收到错误消息。
由于看起来您无法一次将整个数据集读入内存,我建议您从获取数据的替代方法中受益。这些方法都不允许您将所有数据读入R:无论您做什么都不会允许这样做。 (从技术上讲,您可以select=
加载的列的子集,但这假定您知道您永远不会使用“大内存”列。)
-
加载到数据库中。可以使用“大服务器”DBMS,如postgres、mariadb或sql server。这不仅仅是“只是R”,超出了这个问题的范围。但是,可以使用RSQLite或DuckDB(单文件数据库)。例如,在shell中(不是R):
-
SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/
$ sqlite3 US_TradeMatrix.sqlite3 SQLite version 3.40.1 2022-12-28 14:03:47 输入“.help”获取使用提示。 sqlite> .mode csv sqlite> .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix
-
DuckDB: https://duckdb.org/docs/data/csv/overview.html
(请注意,前导的
D
是提示字符串,不是用户输入的。)$ duckdb US_TradeMatrix.duckdb v0.8.1 6536a77232 输入“.help”获取使用提示。 D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR) D COPY TradeMatrix FROM 'US_TradeMatrix_Part5_ST202306191712_v1.csv' (AUTO_DETECT TRUE);
我对表模式进行了猜测,请验证。
从这里开始,可以使用
DBI
和RSQLite
或duckdb
来制定在尝试引入R之前减少数据(并可选执行其他操作)的SQL语句。 -
-
使用
duckdb
对CSV文件本身进行惰性操作,仍然需要SQL。(请注意,这可能不像导入到duckdb
文件中那样快速,如上述第1种方法。我没有对其进行基准测试,但我无法想象在CSV文件上进行随机访问与索引数据库的效率相比如何。)duck <- DBI::dbConnect(duckdb::duckdb()) DBI::dbGetQuery(duck, "select * from 'US_TradeMatrix_Part5_ST202306191712_v1.csv' limit 10") # Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote # 1 2016 2016 0000 World 0000 World 01 Imports 0 Food and live animals 1068138072 <NA> # 2 2016 2016 0000 World 0000 World 01 Imports 00 Live animals other than animals of division 03 21120296 <NA> # 3 2016 2016 0000 World 0000 World 01 Imports 001 Live animals other than animals of division 03 21120296 <NA> # 4 2016 2016 0000 World 0000 World 01 Imports 01 Meat and meat preparations 128581301 <NA> # 5 2016 2016 0000 World 0000 World 01 Imports 011 Meat of bovine animals, fresh, chilled or frozen 39233046 <NA> # 6 2016 2016 0000 World 0000 World 01 Imports 012 Other meat and edible meat offal 66138000 <NA> # 7 2016 2016 0000 World 0000 World 01 Imports 016 Meat, edible meat offal, salted, dried; flours, meals 4135049 <NA> # 8 2016 2016 0000 World 0000 World 01 Imports 017 Meat, edible meat offal, prepared, preserved, n.e.s. 19075207 <NA> # 9 2016 2016 0000 World 0000 World 01 Imports 02 Dairy products and birds' eggs 76369447 <NA> # 10 2016 2016 0000 World 0000 World 01 Imports 022 Milk, cream and milk products (excluding butter, cheese) 37031756 <NA>
-
使用
arrow
进行惰性操作,使用dplyr
作为语言:tradematrix <- arrow::open_csv_dataset("US_TradeMatrix_Part5_ST202306191712_v1.csv") tradematrix %>% head(n=10) %>% collect() # # A tibble: 10 × 12 # Year `Year Label` Economy `Economy Label` Partner `Partner Label` Flow `Flow Label` SitcRev3Product `SitcRev3Product Label` `US dollars at current prices in thousands` US dollars at current prices…¹ # <int> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <dbl> <???> # 1 2016 2016 0 World 0000 World 1 Imports 0
英文:
The error is likely because your file does not have sufficient memory (RAM) to process it. I have 64GB of RAM, it took well over 6 minutes to read in the file, and the resulting object is over 13GB in R (the original file is over 20GB in size, uncompressed). I received no error.
Since it appears you cannot read that entire dataset into RAM at one time, I suggest you would benefit from alternative methods of getting at your data. None of these allow you to read all data into R: nothing you can do will allow that. (Technically, you can select=
a subset of the columns to load in, but that assumes there are "large memory" columns you know you'll never use.)
-
Load into a database. One could use a "big server" DBMS such as postgres, mariadb, or sql server. Doing this is a lot more than "just R", outside the scope of this question. However, one might be able to use RSQLite or DuckDB (single-file databases). For instance, on the shell (not R):
-
SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/
$ sqlite3 US_TradeMatrix.sqlite3 SQLite version 3.40.1 2022-12-28 14:03:47 Enter ".help" for usage hints. sqlite> .mode csv sqlite> .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix
-
DuckDB: https://duckdb.org/docs/data/csv/overview.html
(Note that the leading
D
is the prompt string, not typed in by the user.)$ duckdb US_TradeMatrix.duckdb v0.8.1 6536a77232 Enter ".help" for usage hints. D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR) D COPY TradeMatrix FROM 'US_TradeMatrix_Part5_ST202306191712_v1.csv' (AUTO_DETECT TRUE);
I made guesses on the table schema, please verify.
From here, one would use
DBI
and eitherRSQLite
orduckdb
to fashion SQL statements that reduce the data (and optionally do other things) before attempting to bring into R. -
-
Lazy operations using
duckdb
on the CSV file itself, still requiring SQL. (Note that this might not be as fast as importing into aduckdb
file as in #1 above. I haven't benchmarked it, but I cannot imagine random-access on CSV files would compare well with indexed database efficiencies.)duck <- DBI::dbConnect(duckdb::duckdb()) DBI::dbGetQuery(duck, "select * from 'US_TradeMatrix_Part5_ST202306191712_v1.csv' limit 10") # Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote # 1 2016 2016 0000 World 0000 World 01 Imports 0 Food and live animals 1068138072 <NA> # 2 2016 2016 0000 World 0000 World 01 Imports 00 Live animals other than animals of division 03 21120296 <NA> # 3 2016 2016 0000 World 0000 World 01 Imports 001 Live animals other than animals of division 03 21120296 <NA> # 4 2016 2016 0000 World 0000 World 01 Imports 01 Meat and meat preparations 128581301 <NA> # 5 2016 2016 0000 World 0000 World 01 Imports 011 Meat of bovine animals, fresh, chilled or frozen 39233046 <NA> # 6 2016 2016 0000 World 0000 World 01 Imports 012 Other meat and edible meat offal 66138000 <NA> # 7 2016 2016 0000 World 0000 World 01 Imports 016 Meat, edible meat offal, salted, dried; flours, meals 4135049 <NA> # 8 2016 2016 0000 World 0000 World 01 Imports 017 Meat, edible meat offal, prepared, preserved, n.e.s. 19075207 <NA> # 9 2016 2016 0000 World 0000 World 01 Imports 02 Dairy products and birds' eggs 76369447 <NA> # 10 2016 2016 0000 World 0000 World 01 Imports 022 Milk, cream and milk products (excluding butter, cheese) 37031756 <NA>
-
Lazy operations with
arrow
, usingdplyr
as the language:tradematrix <- arrow::open_csv_dataset("US_TradeMatrix_Part5_ST202306191712_v1.csv") tradematrix %>% head(n=10) %>% collect() # # A tibble: 10 × 12 # Year `Year Label` Economy `Economy Label` Partner `Partner Label` Flow `Flow Label` SitcRev3Product `SitcRev3Product Label` `US dollars at current prices in thousands` US dollars at current prices…¹ # <int> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <dbl> <???> # 1 2016 2016 0 World 0000 World 1 Imports 0 Food and live animals 1068138072. . # 2 2016 2016 0 World 0000 World 1 Imports 00 Live animals other than animals of division 03 21120296. . # 3 2016 2016 0 World 0000 World 1 Imports 001 Live animals other than animals of division 03 21120296. . # 4 2016 2016 0 World 0000 World 1 Imports 01 Meat and meat preparations 128581301. . # 5 2016 2016 0 World 0000 World 1 Imports 011 Meat of bovine animals, fresh, chilled or frozen 39233046. . # 6 2016 2016 0 World 0000 World 1 Imports 012 Other meat and edible meat offal 66138000. . # 7 2016 2016 0 World 0000 World 1 Imports 016 Meat, edible meat offal, salted, dried; flours, meals 4135049. . # 8 2016 2016 0 World 0000 World 1 Imports 017 Meat, edible meat offal, prepared, preserved, n.e.s. 19075207. . # 9 2016 2016 0 World 0000 World 1 Imports 02 Dairy products and birds' eggs 76369447. . # 10 2016 2016 0 World 0000 World 1 Imports 022 Milk, cream and milk products (excluding butter, cheese) 37031756. . # # ℹ abbreviated name: ¹`US dollars at current prices in thousands Footnote`
This method supports most of
dplyr
s verbs, and many base R functions withinmutate
andsummarize
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论