将SQLite表导出为Apache Parquet,无需创建数据框。

huangapple go评论65阅读模式
英文:

Export a SQLite table to Apache parquet without creating a dataframe

问题

我有多个巨大的CSV文件,必须根据多个条件/键(=列值)将它们导出为Apache Parquet格式并拆分成较小的文件。据我了解,Apache arrow是一个允许处理Apache parquet文件的R包。

我在一个共享的实验室环境中工作,考虑到有限的RAM内存(与同时在此环境中工作的用户数量相比),我们建议将数据帧创建在本地SQLite数据库中,而不是将它们导入内存(RAM)中。

以下伪代码显示了如何将我的CSV文件导入本地SQLite数据库。在以下代码中,我使用了sqldftidyverse包。

input_file_path <- "D:/tmp/mydata.csv"
db_file_path <- "D:/tmp/db_tmp_sqlite.db"
unlink(db_file_path)
sqldf(str_c("attach '", db_file_path, "' as new"))
sqldf(read.csv.sql(
    file = input_file_path,
    sql = "
        create table mytable as
        select
            . . .
        from
            file
    ",
    `field.types` = list(
      . . .
    ),
    header = TRUE,
    sep = ",",
    eol = "\n",
    dbname = db_file_path,
    drv = "SQLite"
))

这正如预期的那样工作,我的表被创建,我可以运行所有需要的SQL查询,特别是添加附加变量(表中的列),这些变量将稍后用作导出表格到Apache Parquet格式的键。然而,根据Apache Arrow for R Cheatsheet,允许基于Apache Parquet格式导出数据的write_dataset函数需要一个数据帧

这正是我的问题,因为R中的数据帧是内存中的,而正如我之前解释的,我的数据在SQLite本地数据库中。这意味着我首先必须执行SELECT将整个数据导出到RAM,类似于以下操作:

df <- sqldf("select * from mytable", dbname = ...)

然后我才能使用创建的df数据帧作为write_dataset的第一个参数来导出和拆分基于Apache Parquet格式的数据。但这不是我想做的。整个问题在于将数据放入SQLite而不是内存(RAM),考虑到我们共享环境中的现有资源限制(内存不足)。

是否有办法在R程序中直接从SQLite转换为Apache Parquet,而不必在导出之前将整个数据放入数据帧中,或者我正在尝试做一些根本不可能的事情?

英文:

I have multiple huge CSV files that I have to export based on Apache Parquet format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow is the R package allowing to work with Apache parquet files.

I work in a shared lab environment and given the limited RAM memory (compared to the number of users who work simultaneously in this same environment) we are advised to create our dataframes in local SQLite databases rather than importing them in-memory (into RAM).

The following pseudo-code shows how I import my CSV files in my local SQLite database. In the following code I use sqldf and tidyverse packages.

input_file_path &lt;- &quot;D:/tmp/mydata.csv&quot;
db_file_path &lt;- &quot;D:/tmp/db_tmp_sqlite.db&quot;
unlink(db_file_path)
sqldf(str_c(&quot;attach &#39;&quot;, db_file_path, &quot;&#39; as new&quot;))
sqldf(read.csv.sql(
    file = input_file_path,
    sql = &quot;
        create table mytable as
        select
            . . .
        from
            file
    &quot;,
    `field.types` = list(
      . . .
    ),
    ##
    header = TRUE,
    sep = &quot;,&quot;,
    eol = &quot;\n&quot;,
    dbname = db_file_path,
    drv = &quot;SQLite&quot;
))

This works well as expected, my table is created and I can run all required SQL queries, in particular adding supplementary variables (columns in my tables) which will be used later as keys to export my tables into Apache Parquet format. However, based on Apache Arrow for R Cheatsheet, the function write_dataset that allows to export my data based on Apache Parquet format, requires a dataframe.

And that is precisely my problem because a dataframe in R is in-memory whereas my data as I explained earlier are in a SQLite local database. This means that first I have to do a SELECT to export the whole data into RAM, something like

df &lt;- sqldf(&quot;select * from mytable&quot;, dbname = ...)

And only then I'd be able to use write_dataset with the created df dataframe as its first argument in order to export and split my data based on Apache Parquet format. But this is not what I wanted to do. The whole point was to put the data in SQLite and not in-memory (RAM) given the existing resource limitations (lack of memory) in our shared environment.

Is there anyway to convert to Apache Parquet directly from SQLite within a R program, without first putting the whole data in a dataframe before the export, or I'm trying to do something which is simply not possible?

答案1

得分: 1

以下是翻译好的内容:

DuckDB具有一些出色的功能,包括能够在不影响R内存的情况下本地导入和导出CSV和parquet格式文件。

TL;DR

con &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = &quot;:memory:&quot;)
DBI::dbExecute(con, &quot;copy (select * from read_csv_auto(&#39;quux.csv&#39;, SAMPLE_SIZE=-1)) to &#39;quux3.pq&#39; (format parquet)&quot;)

就是这样。数据从未导入到R中。(现在,不过,duckdb是否可以自己完成而不会耗尽内存是另一个问题,我没有在本地验证...)

买家注意:然而,在您盲目信任之前,我强烈建议您对类进行一些验证。大部分可以通过duckdb轻松以"懒惰"的方式完成,而不必将整个框架加载到R中。我鼓励您阅读更多关于原生查询CSV/parquet文件的文档(而不是加载到R中)。

方法

为了比较两种方法(通过data.frame,您不想这样做,以及通过duckdb),我们将使用“RSS”(来自ps::ps_memory_info())来指示当前R进程的内存使用情况。从?ps::ps_memory_info中:

* 'rss': "Resident Set Size",这是一个进程已使用的非交换物理内存(字节)。在UNIX上,它匹配“top”的“RES”列(参见文档)。在Windows上,这是“wset”字段的别名,它匹配“taskmgr.exe”的“Memory”列。

尽管不是真正衡量对R的影响的完美方法,但它确实表明在使用DuckDB时对R的影响明显较小。

此外,每种方法都在新的R --vanilla实例中完成。没有加载.Rprofile或site-init文件。您看到的代码就是执行的代码,没有其他内容。

通过data.frame在R中

Sys.getpid()
# [1] 20860

file.info(&quot;quux.csv&quot;)[&quot;size&quot;] / (1024^2) # MBs
#              size
# quux.csv 299.3079
mem1 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
dat &lt;- read.csv(&quot;quux.csv&quot;)
mem2 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
arrow::write_parquet(dat, &quot;quux1.pq&quot;)
mem3 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
c(mem1, mem2, mem3, diff = mem3 - mem1) / (1024^2)
#        rss        rss        rss   diff.rss 
#   57.70703 1218.55859 1548.54688 1490.83984 

这表明在读取完整数据后,R增加了1490MB的内存。在相同的条件下,data.table::fread而不是read.csv会导致内存节省408MB。不过,我没有尝试优化这部分内容 将SQLite表导出为Apache Parquet,无需创建数据框。

(值得注意的是,这些数字在不同运行时会有所变化,可能会受到本答案范围之外的其他因素的影响。我的笔记本电脑有64GB的RAM,可能与您看到的情况有所不同。)

从CSV读取,写入parquet的DuckDB

Sys.getpid()
# [1] 32485

mem1 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
con &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = &quot;:memory:&quot;)
DBI::dbExecute(con, &quot;copy (select * from read_csv_auto(&#39;quux.csv&#39;)) to &#39;quux2.pq&#39; (format parquet)&quot;)
# [1] 1000207
mem2 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
c(mem1, mem2, diff=mem2 - mem1) / (1024^2)
#      rss      rss diff.rss 
# 63.23828 86.35938 23.12109 

只有23MB在此过程中。

比较生成的文件。

file.info(list.files(pattern = &quot;quux.*&quot;))[&quot;size&quot;] /  (1024^2)
#               size
# quux.csv 299.30786
# quux1.pq  51.69008
# quux2.pq  66.84857

较大的文件是由下面提到的类别差异造成的。我猜测,如果我们_强制_某些character列为logical,那么它的文件大小可能会减小。

对内容进行更深入的检查:

ds1 &lt;- arrow::open_dataset(&quot;quux1.pq&quot;)
ds2 &lt;- arrow::open_dataset(&quot;quux2.pq&quot;)
identical(names(ds1), names(ds2))
# [1] TRUE

data.frame(
  ds1 = sapply(head(ds1, 1), function(z) class(z)[1]),
  ds2 = sapply(head(ds2, 1), function(z) class(z)[1])
)
#           ds1       ds2
# V1  character character
# V2    integer   integer
# V3  character character
# V4    integer   integer
# V5    logical character
# V6    integer   integer
# V7  character   POSIXct
# V8    logical character
# V9    numeric   numeric
# V10   numeric   numeric
# V11   numeric   integer
# V12   integer   integer
# V13   integer   integer
# V14   integer   integer
# V15   numeric   numeric
# V

<details>
<summary>英文:</summary>

DuckDB has several great features, including the ability to both import and export CSV and parquet formats _natively_ without affecting R memory.

# TL;DR

```r
con &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = &quot;:memory:&quot;)
DBI::dbExecute(con, &quot;copy (select * from read_csv_auto(&#39;quux.csv&#39;, SAMPLE_SIZE=-1)) to &#39;quux3.pq&#39; (format parquet)&quot;)

And that is all. The data is never imported into R. (Now, whether duckdb can do it itself without exhausting memory is another issue I did not validate locally ...)

Caveat emptor: before you trust this blindly, however, I strongly urge you to do some validation of classes. Most of it can be done easily in a "lazy" fashion using duckdb without having to load the whole frame into R. I encourage you to read more of its documentation for querying CSV/parquet files natively (without loading into R).

Methodology

To make a comparison of the two methods (via a data.frame which you don't want to do, and via duckdb), we'll use "RSS" (from ps::ps_memory_info()) to indicate the current R process memory usage. From ?ps::ps_memory_info:

        * &#39;rss&#39;: &quot;Resident Set Size&quot;, this is the non-swapped physical
          memory a process has used (bytes). On UNIX it matches &quot;top&quot;‘s
          &#39;RES&#39; column (see doc). On Windows this is an alias for
          &#39;wset&#39; field and it matches &quot;Memory&quot; column of &#39;taskmgr.exe&#39;.

Though an imperfect measure of the true impact to R, it does indicate a significantly smaller impact on R when using DuckDB.

Also, each method is done in a fresh instance of R --vanilla. No .Rprofile or site-init files are loaded. The code you see is the code that is executed, nothing more.

In R via data.frame

Sys.getpid()
# [1] 20860

file.info(&quot;quux.csv&quot;)[&quot;size&quot;] / (1024^2) # MBs
#              size
# quux.csv 299.3079
mem1 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
dat &lt;- read.csv(&quot;quux.csv&quot;)
mem2 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
arrow::write_parquet(dat, &quot;quux1.pq&quot;)
mem3 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
c(mem1, mem2, mem3, diff = mem3 - mem1) / (1024^2)
#        rss        rss        rss   diff.rss 
#   57.70703 1218.55859 1548.54688 1490.83984 

This indicates R is 1490MB larger after reading in the full data. (FYI, data.table::fread instead of read.csv results in only 408MB of memory gain, same austere conditions. I'm not trying to optimize this part, though 将SQLite表导出为Apache Parquet,无需创建数据框。

(FYI, these numbers vary for me from run-to-run and are likely to be different based on other factors outside the scope of this answer. My laptop has 64GB of RAM, it might not be comparable to exactly what you see.)

DuckDB, read from CSV, write to parquet

Sys.getpid()
# [1] 32485

mem1 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
con &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = &quot;:memory:&quot;)
DBI::dbExecute(con, &quot;copy (select * from read_csv_auto(&#39;quux.csv&#39;)) to &#39;quux2.pq&#39; (format parquet)&quot;)
# [1] 1000207
mem2 &lt;- ps::ps_memory_info()[&quot;rss&quot;]
c(mem1, mem2, diff=mem2 - mem1) / (1024^2)
#      rss      rss diff.rss 
# 63.23828 86.35938 23.12109 

showing only 23MB in this process.

Comparing the resulting files.

file.info(list.files(pattern = &quot;quux.*&quot;))[&quot;size&quot;] /  (1024^2)
#               size
# quux.csv 299.30786
# quux1.pq  51.69008
# quux2.pq  66.84857

The larger file is due to the differences in class noted below. My guess is that if we force some of the character columns to be logical, then its file-size might be reduced.

A little more in-depth look at the contents:

ds1 &lt;- arrow::open_dataset(&quot;quux1.pq&quot;)
ds2 &lt;- arrow::open_dataset(&quot;quux2.pq&quot;)
identical(names(ds1), names(ds2))
# [1] TRUE

data.frame(
  ds1 = sapply(head(ds1, 1), function(z) class(z)[1]),
  ds2 = sapply(head(ds2, 1), function(z) class(z)[1])
)
#           ds1       ds2
# V1  character character
# V2    integer   integer
# V3  character character
# V4    integer   integer
# V5    logical character
# V6    integer   integer
# V7  character   POSIXct
# V8    logical character
# V9    numeric   numeric
# V10   numeric   numeric
# V11   numeric   integer
# V12   integer   integer
# V13   integer   integer
# V14   integer   integer
# V15   numeric   numeric
# V16   integer   integer
# V17   integer   integer
# V18   numeric   numeric
# V19   numeric   numeric
# V20   logical character
# V21   numeric   numeric
# V22   numeric   numeric
# V23   numeric   numeric
# V24   integer   integer
# V25   logical character
# V26   integer   integer
# V27   integer   integer
# V28   integer   integer
# V29   integer   integer
# V30   logical character
# V31   logical character
# V32   numeric   numeric
# V33   logical character
# V34   logical character
# V35   logical character
# V36   logical character
# V37   logical character
# V38   logical character
# V39 character   POSIXct
# V40   logical character
# V41   logical character
# V42   numeric   integer
# V43   logical character
# V44   logical character
# V45   logical character
# V46   logical character
# V47   numeric   numeric
# V48   logical character
# V49   logical character
# V50   logical character
# V51   logical character
# V52   logical character
# V53   logical character
# V54   logical character
# V55   logical character
# V56   logical character
# V57   logical character

Some interesting things to deduce from this:

  • two fields are timestamps, and the duckdb method correctly identified them, parsed them, and stored as numeric timestamps; since I didn't tell R the column classes explicitly, it defaulted to character for them;
  • all of the columns that are logical in ds1 and character in ds2 are all null (sorry, it was the data I had); the fact that they are different classes indicate that duckdb defaults to string-like nulls instead of "bit", may or may not be a factor for you;
  • only two columns were classified as numeric-vs-integer; V11 was truly integer, it's fine; the second one, V42 shows that the heuristic used for differentiating between numeric and integer missed something. The first row of V42 that contained any fractional component was on row 37159.

Fixing data discrepancies

Column V42 indicates that we need to be very cognizant of what is going in and out of that parquet generator. My guess is that it's in the "CSV Import" step, so looking at CSV Loading suggests the need to change the SAMPLE_SIZE. While relatively inefficient, I'll use -1 indicating that it needs to look at all values in a column to determine its class. Slower, yes, but also safer.

Validation of this assumption:

&gt; str(DBI::dbGetQuery(con, &quot;select * from read_csv_auto(&#39;quux.csv&#39;) limit 5&quot;)[c(&quot;V11&quot;,&quot;V42&quot;)])
&#39;data.frame&#39;:	5 obs. of  2 variables:
 $ V11: int  4407 4408 4408 4407 4408
 $ V42: int  26 25 23 21 3
&gt; str(DBI::dbGetQuery(con, &quot;select * from read_csv_auto(&#39;quux.csv&#39;, SAMPLE_SIZE=-1) limit 5&quot;)[c(&quot;V11&quot;,&quot;V42&quot;)])
&#39;data.frame&#39;:	5 obs. of  2 variables:
 $ V11: int  4407 4408 4408 4407 4408
 $ V42: num  26 25 23 21 3

Indeed, V11 is still good, and V42 changes from int to num.

After rerunning with this new parameter,

DBI::dbExecute(con, &quot;copy (select * from read_csv_auto(&#39;quux.csv&#39;, SAMPLE_SIZE=-1)) to &#39;quux3.pq&#39; (format parquet)&quot;)

offline validation confirmed that all values are correct.

huangapple
  • 本文由 发表于 2023年2月9日 01:07:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389288.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定