无法在data.table中读取文件?

huangapple go评论81阅读模式
英文:

Can not read file in data.table?

问题

我从UNCTAD下载了一个文件,然后解压缩,再使用data.table包进行读取。

https://unctadstat.unctad.org/7zip/US_TradeMatrix_Part5.csv.7z

temp <- data.table::fread("US_TradeMatrix_Part5_ST202209201450_v1.csv", 
      drop = c("US dollars at current prices in thousands Footnote", "Flow"), 
      check.names = TRUE, showProgress = TRUE)

我遇到了以下错误
R字符字符串限制为2^31-1字节
请建议。

英文:

i downloaded a file from UNCTAD and unzip then i read using data.table package

https://unctadstat.unctad.org/7zip/US_TradeMatrix_Part5.csv.7z

temp &lt;- data.table::fread(&quot;US_TradeMatrix_Part5_ST202209201450_v1.csv&quot;, 
      drop = c(&quot;US dollars at current prices in thousands Footnote&quot;, &quot;Flow&quot;), 
      check.names = TRUE, showProgress = TRUE)

I get the following error
R character strings are limited to 2^31-1 bytes
please advise.

答案1

得分: 2

错误很可能是因为您的文件没有足够的内存(RAM)来处理它。我有64GB的RAM,读取文件花了超过6分钟,结果的对象在R中超过13GB(原始文件未压缩时超过20GB)。我没有收到错误消息。

由于看起来您无法一次将整个数据集读入内存,我建议您从获取数据的替代方法中受益。这些方法都不允许您将所有数据读入R:无论您做什么都不会允许这样做。 (从技术上讲,您可以select=加载的列的子集,但这假定您知道您永远不会使用“大内存”列。)

  1. 加载到数据库中。可以使用“大服务器”DBMS,如postgres、mariadb或sql server。这不仅仅是“只是R”,超出了这个问题的范围。但是,可以使用RSQLite或DuckDB(单文件数据库)。例如,在shell中(不是R):

    • SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/

      $ sqlite3 US_TradeMatrix.sqlite3
      SQLite version 3.40.1 2022-12-28 14:03:47
      输入“.help”获取使用提示。
      sqlite&gt; .mode csv
      sqlite&gt; .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix
      
    • DuckDB: https://duckdb.org/docs/data/csv/overview.html

      (请注意,前导的 D 是提示字符串,不是用户输入的。)

      $ duckdb US_TradeMatrix.duckdb
      v0.8.1 6536a77232
      输入“.help”获取使用提示。
      D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR)
      D COPY TradeMatrix FROM 'US_TradeMatrix_Part5_ST202306191712_v1.csv' (AUTO_DETECT TRUE);
      

      我对表模式进行了猜测,请验证。

    从这里开始,可以使用DBIRSQLiteduckdb来制定在尝试引入R之前减少数据(并可选执行其他操作)的SQL语句。

  2. 使用duckdb对CSV文件本身进行惰性操作,仍然需要SQL。(请注意,这可能不像导入到duckdb文件中那样快速,如上述第1种方法。我没有对其进行基准测试,但我无法想象在CSV文件上进行随机访问与索引数据库的效率相比如何。)

    duck <- DBI::dbConnect(duckdb::duckdb())
    DBI::dbGetQuery(duck, "select * from 'US_TradeMatrix_Part5_ST202306191712_v1.csv' limit 10")
    #    Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product                                    SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote
    # 1  2016       2016    0000         World    0000         World   01    Imports               0                                    Food and live animals                                1068138072                                               <NA>
    # 2  2016       2016    0000         World    0000         World   01    Imports              00           Live animals other than animals of division 03                                  21120296                                               <NA>
    # 3  2016       2016    0000         World    0000         World   01    Imports             001           Live animals other than animals of division 03                                  21120296                                               <NA>
    # 4  2016       2016    0000         World    0000         World   01    Imports              01                               Meat and meat preparations                                 128581301                                               <NA>
    # 5  2016       2016    0000         World    0000         World   01    Imports             011         Meat of bovine animals, fresh, chilled or frozen                                  39233046                                               <NA>
    # 6  2016       2016    0000         World    0000         World   01    Imports             012                         Other meat and edible meat offal                                  66138000                                               <NA>
    # 7  2016       2016    0000         World    0000         World   01    Imports             016    Meat, edible meat offal, salted, dried; flours, meals                                   4135049                                               <NA>
    # 8  2016       2016    0000         World    0000         World   01    Imports             017     Meat, edible meat offal, prepared, preserved, n.e.s.                                  19075207                                               <NA>
    # 9  2016       2016    0000         World    0000         World   01    Imports              02                           Dairy products and birds' eggs                                  76369447                                               <NA>
    # 10 2016       2016    0000         World    0000         World   01    Imports             022 Milk, cream and milk products (excluding butter, cheese)                                  37031756                                               <NA>
    
  3. 使用arrow进行惰性操作,使用dplyr作为语言:

    tradematrix <- arrow::open_csv_dataset("US_TradeMatrix_Part5_ST202306191712_v1.csv")
    tradematrix %>%
      head(n=10) %>%
      collect()
    # # A tibble: 10 × 12
    #     Year `Year Label` Economy `Economy Label` Partner `Partner Label`  Flow `Flow Label` SitcRev3Product `SitcRev3Product Label`                                  `US dollars at current prices in thousands` US dollars at current prices…¹
    #    <int>        <int>   <int> <chr>           <chr>   <chr>           <int> <chr>        <chr>           <chr>                                                                                          <dbl>                          <???>
    #  1  2016         2016       0 World           0000    World               1 Imports      0
    
    
英文:

The error is likely because your file does not have sufficient memory (RAM) to process it. I have 64GB of RAM, it took well over 6 minutes to read in the file, and the resulting object is over 13GB in R (the original file is over 20GB in size, uncompressed). I received no error.

Since it appears you cannot read that entire dataset into RAM at one time, I suggest you would benefit from alternative methods of getting at your data. None of these allow you to read all data into R: nothing you can do will allow that. (Technically, you can select= a subset of the columns to load in, but that assumes there are "large memory" columns you know you'll never use.)

  1. Load into a database. One could use a "big server" DBMS such as postgres, mariadb, or sql server. Doing this is a lot more than "just R", outside the scope of this question. However, one might be able to use RSQLite or DuckDB (single-file databases). For instance, on the shell (not R):

    • SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/

      $ sqlite3 US_TradeMatrix.sqlite3
      SQLite version 3.40.1 2022-12-28 14:03:47
      Enter &quot;.help&quot; for usage hints.
      sqlite&gt; .mode csv
      sqlite&gt; .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix
      
    • DuckDB: https://duckdb.org/docs/data/csv/overview.html

      (Note that the leading D is the prompt string, not typed in by the user.)

      $ duckdb US_TradeMatrix.duckdb
      v0.8.1 6536a77232
      Enter &quot;.help&quot; for usage hints.
      D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR)
      D COPY TradeMatrix FROM &#39;US_TradeMatrix_Part5_ST202306191712_v1.csv&#39; (AUTO_DETECT TRUE);
      

      I made guesses on the table schema, please verify.

    From here, one would use DBI and either RSQLite or duckdb to fashion SQL statements that reduce the data (and optionally do other things) before attempting to bring into R.

  2. Lazy operations using duckdb on the CSV file itself, still requiring SQL. (Note that this might not be as fast as importing into a duckdb file as in #1 above. I haven't benchmarked it, but I cannot imagine random-access on CSV files would compare well with indexed database efficiencies.)

    duck &lt;- DBI::dbConnect(duckdb::duckdb())
    DBI::dbGetQuery(duck, &quot;select * from &#39;US_TradeMatrix_Part5_ST202306191712_v1.csv&#39; limit 10&quot;)
    #    Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product                                    SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote
    # 1  2016       2016    0000         World    0000         World   01    Imports               0                                    Food and live animals                                1068138072                                               &lt;NA&gt;
    # 2  2016       2016    0000         World    0000         World   01    Imports              00           Live animals other than animals of division 03                                  21120296                                               &lt;NA&gt;
    # 3  2016       2016    0000         World    0000         World   01    Imports             001           Live animals other than animals of division 03                                  21120296                                               &lt;NA&gt;
    # 4  2016       2016    0000         World    0000         World   01    Imports              01                               Meat and meat preparations                                 128581301                                               &lt;NA&gt;
    # 5  2016       2016    0000         World    0000         World   01    Imports             011         Meat of bovine animals, fresh, chilled or frozen                                  39233046                                               &lt;NA&gt;
    # 6  2016       2016    0000         World    0000         World   01    Imports             012                         Other meat and edible meat offal                                  66138000                                               &lt;NA&gt;
    # 7  2016       2016    0000         World    0000         World   01    Imports             016    Meat, edible meat offal, salted, dried; flours, meals                                   4135049                                               &lt;NA&gt;
    # 8  2016       2016    0000         World    0000         World   01    Imports             017     Meat, edible meat offal, prepared, preserved, n.e.s.                                  19075207                                               &lt;NA&gt;
    # 9  2016       2016    0000         World    0000         World   01    Imports              02                           Dairy products and birds&#39; eggs                                  76369447                                               &lt;NA&gt;
    # 10 2016       2016    0000         World    0000         World   01    Imports             022 Milk, cream and milk products (excluding butter, cheese)                                  37031756                                               &lt;NA&gt;
    
  3. Lazy operations with arrow, using dplyr as the language:

    tradematrix &lt;- arrow::open_csv_dataset(&quot;US_TradeMatrix_Part5_ST202306191712_v1.csv&quot;)
    tradematrix %&gt;%
      head(n=10) %&gt;%
      collect()
    # # A tibble: 10 &#215; 12
    #     Year `Year Label` Economy `Economy Label` Partner `Partner Label`  Flow `Flow Label` SitcRev3Product `SitcRev3Product Label`                                  `US dollars at current prices in thousands` US dollars at current prices…&#185;
    #    &lt;int&gt;        &lt;int&gt;   &lt;int&gt; &lt;chr&gt;           &lt;chr&gt;   &lt;chr&gt;           &lt;int&gt; &lt;chr&gt;        &lt;chr&gt;           &lt;chr&gt;                                                                                          &lt;dbl&gt;                          &lt;???&gt;
    #  1  2016         2016       0 World           0000    World               1 Imports      0               Food and live animals                                                                    1068138072.                              .
    #  2  2016         2016       0 World           0000    World               1 Imports      00              Live animals other than animals of division 03                                             21120296.                              .
    #  3  2016         2016       0 World           0000    World               1 Imports      001             Live animals other than animals of division 03                                             21120296.                              .
    #  4  2016         2016       0 World           0000    World               1 Imports      01              Meat and meat preparations                                                                128581301.                              .
    #  5  2016         2016       0 World           0000    World               1 Imports      011             Meat of bovine animals, fresh, chilled or frozen                                           39233046.                              .
    #  6  2016         2016       0 World           0000    World               1 Imports      012             Other meat and edible meat offal                                                           66138000.                              .
    #  7  2016         2016       0 World           0000    World               1 Imports      016             Meat, edible meat offal, salted, dried; flours, meals                                       4135049.                              .
    #  8  2016         2016       0 World           0000    World               1 Imports      017             Meat, edible meat offal, prepared, preserved, n.e.s.                                       19075207.                              .
    #  9  2016         2016       0 World           0000    World               1 Imports      02              Dairy products and birds&#39; eggs                                                             76369447.                              .
    # 10  2016         2016       0 World           0000    World               1 Imports      022             Milk, cream and milk products (excluding butter, cheese)                                   37031756.                              .
    # # ℹ abbreviated name: &#185;​`US dollars at current prices in thousands Footnote`
    

    This method supports most of dplyrs verbs, and many base R functions within mutate and summarize.

huangapple
  • 本文由 发表于 2023年7月17日 21:00:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704746.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定