英文:
Export a SQLite table to Apache parquet without creating a dataframe
问题
我有多个巨大的CSV文件,必须根据多个条件/键(=列值)将它们导出为Apache Parquet
格式并拆分成较小的文件。据我了解,Apache arrow
是一个允许处理Apache parquet
文件的R包。
我在一个共享的实验室环境中工作,考虑到有限的RAM内存(与同时在此环境中工作的用户数量相比),我们建议将数据帧创建在本地SQLite数据库中,而不是将它们导入内存(RAM)中。
以下伪代码显示了如何将我的CSV文件导入本地SQLite数据库。在以下代码中,我使用了sqldf
和tidyverse
包。
input_file_path <- "D:/tmp/mydata.csv"
db_file_path <- "D:/tmp/db_tmp_sqlite.db"
unlink(db_file_path)
sqldf(str_c("attach '", db_file_path, "' as new"))
sqldf(read.csv.sql(
file = input_file_path,
sql = "
create table mytable as
select
. . .
from
file
",
`field.types` = list(
. . .
),
header = TRUE,
sep = ",",
eol = "\n",
dbname = db_file_path,
drv = "SQLite"
))
这正如预期的那样工作,我的表被创建,我可以运行所有需要的SQL查询,特别是添加附加变量(表中的列),这些变量将稍后用作导出表格到Apache Parquet
格式的键。然而,根据Apache Arrow for R Cheatsheet,允许基于Apache Parquet
格式导出数据的write_dataset
函数需要一个数据帧。
这正是我的问题,因为R中的数据帧是内存中的,而正如我之前解释的,我的数据在SQLite本地数据库中。这意味着我首先必须执行SELECT将整个数据导出到RAM,类似于以下操作:
df <- sqldf("select * from mytable", dbname = ...)
然后我才能使用创建的df
数据帧作为write_dataset
的第一个参数来导出和拆分基于Apache Parquet
格式的数据。但这不是我想做的。整个问题在于将数据放入SQLite而不是内存(RAM),考虑到我们共享环境中的现有资源限制(内存不足)。
是否有办法在R程序中直接从SQLite转换为Apache Parquet,而不必在导出之前将整个数据放入数据帧中,或者我正在尝试做一些根本不可能的事情?
英文:
I have multiple huge CSV files that I have to export based on Apache Parquet
format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow
is the R package allowing to work with Apache parquet
files.
I work in a shared lab environment and given the limited RAM memory (compared to the number of users who work simultaneously in this same environment) we are advised to create our dataframes in local SQLite databases rather than importing them in-memory (into RAM).
The following pseudo-code shows how I import my CSV files in my local SQLite database. In the following code I use sqldf
and tidyverse
packages.
input_file_path <- "D:/tmp/mydata.csv"
db_file_path <- "D:/tmp/db_tmp_sqlite.db"
unlink(db_file_path)
sqldf(str_c("attach '", db_file_path, "' as new"))
sqldf(read.csv.sql(
file = input_file_path,
sql = "
create table mytable as
select
. . .
from
file
",
`field.types` = list(
. . .
),
##
header = TRUE,
sep = ",",
eol = "\n",
dbname = db_file_path,
drv = "SQLite"
))
This works well as expected, my table is created and I can run all required SQL queries, in particular adding supplementary variables (columns in my tables) which will be used later as keys to export my tables into Apache Parquet format. However, based on Apache Arrow for R Cheatsheet, the function write_dataset
that allows to export my data based on Apache Parquet
format, requires a dataframe.
And that is precisely my problem because a dataframe in R is in-memory whereas my data as I explained earlier are in a SQLite local database. This means that first I have to do a SELECT to export the whole data into RAM, something like
df <- sqldf("select * from mytable", dbname = ...)
And only then I'd be able to use write_dataset
with the created df
dataframe as its first argument in order to export and split my data based on Apache Parquet
format. But this is not what I wanted to do. The whole point was to put the data in SQLite and not in-memory (RAM) given the existing resource limitations (lack of memory) in our shared environment.
Is there anyway to convert to Apache Parquet directly from SQLite within a R program, without first putting the whole data in a dataframe before the export, or I'm trying to do something which is simply not possible?
答案1
得分: 1
以下是翻译好的内容:
DuckDB具有一些出色的功能,包括能够在不影响R内存的情况下本地导入和导出CSV和parquet格式文件。
TL;DR
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")
就是这样。数据从未导入到R中。(现在,不过,duckdb是否可以自己完成而不会耗尽内存是另一个问题,我没有在本地验证...)
买家注意:然而,在您盲目信任之前,我强烈建议您对类进行一些验证。大部分可以通过duckdb
轻松以"懒惰"的方式完成,而不必将整个框架加载到R中。我鼓励您阅读更多关于原生查询CSV/parquet文件的文档(而不是加载到R中)。
方法
为了比较两种方法(通过data.frame
,您不想这样做,以及通过duckdb
),我们将使用“RSS”(来自ps::ps_memory_info()
)来指示当前R进程的内存使用情况。从?ps::ps_memory_info
中:
* 'rss': "Resident Set Size",这是一个进程已使用的非交换物理内存(字节)。在UNIX上,它匹配“top”的“RES”列(参见文档)。在Windows上,这是“wset”字段的别名,它匹配“taskmgr.exe”的“Memory”列。
尽管不是真正衡量对R的影响的完美方法,但它确实表明在使用DuckDB时对R的影响明显较小。
此外,每种方法都在新的R --vanilla
实例中完成。没有加载.Rprofile
或site-init文件。您看到的代码就是执行的代码,没有其他内容。
通过data.frame在R中
Sys.getpid()
# [1] 20860
file.info("quux.csv")["size"] / (1024^2) # MBs
# size
# quux.csv 299.3079
mem1 <- ps::ps_memory_info()["rss"]
dat <- read.csv("quux.csv")
mem2 <- ps::ps_memory_info()["rss"]
arrow::write_parquet(dat, "quux1.pq")
mem3 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, mem3, diff = mem3 - mem1) / (1024^2)
# rss rss rss diff.rss
# 57.70703 1218.55859 1548.54688 1490.83984
这表明在读取完整数据后,R增加了1490MB的内存。在相同的条件下,data.table::fread
而不是read.csv
会导致内存节省408MB。不过,我没有尝试优化这部分内容
(值得注意的是,这些数字在不同运行时会有所变化,可能会受到本答案范围之外的其他因素的影响。我的笔记本电脑有64GB的RAM,可能与您看到的情况有所不同。)
从CSV读取,写入parquet的DuckDB
Sys.getpid()
# [1] 32485
mem1 <- ps::ps_memory_info()["rss"]
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv')) to 'quux2.pq' (format parquet)")
# [1] 1000207
mem2 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, diff=mem2 - mem1) / (1024^2)
# rss rss diff.rss
# 63.23828 86.35938 23.12109
只有23MB在此过程中。
比较生成的文件。
file.info(list.files(pattern = "quux.*"))["size"] / (1024^2)
# size
# quux.csv 299.30786
# quux1.pq 51.69008
# quux2.pq 66.84857
较大的文件是由下面提到的类别差异造成的。我猜测,如果我们_强制_某些character
列为logical
,那么它的文件大小可能会减小。
对内容进行更深入的检查:
ds1 <- arrow::open_dataset("quux1.pq")
ds2 <- arrow::open_dataset("quux2.pq")
identical(names(ds1), names(ds2))
# [1] TRUE
data.frame(
ds1 = sapply(head(ds1, 1), function(z) class(z)[1]),
ds2 = sapply(head(ds2, 1), function(z) class(z)[1])
)
# ds1 ds2
# V1 character character
# V2 integer integer
# V3 character character
# V4 integer integer
# V5 logical character
# V6 integer integer
# V7 character POSIXct
# V8 logical character
# V9 numeric numeric
# V10 numeric numeric
# V11 numeric integer
# V12 integer integer
# V13 integer integer
# V14 integer integer
# V15 numeric numeric
# V
<details>
<summary>英文:</summary>
DuckDB has several great features, including the ability to both import and export CSV and parquet formats _natively_ without affecting R memory.
# TL;DR
```r
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")
And that is all. The data is never imported into R. (Now, whether duckdb can do it itself without exhausting memory is another issue I did not validate locally ...)
Caveat emptor: before you trust this blindly, however, I strongly urge you to do some validation of classes. Most of it can be done easily in a "lazy" fashion using duckdb
without having to load the whole frame into R. I encourage you to read more of its documentation for querying CSV/parquet files natively (without loading into R).
Methodology
To make a comparison of the two methods (via a data.frame
which you don't want to do, and via duckdb
), we'll use "RSS" (from ps::ps_memory_info()
) to indicate the current R process memory usage. From ?ps::ps_memory_info
:
* 'rss': "Resident Set Size", this is the non-swapped physical
memory a process has used (bytes). On UNIX it matches "top"‘s
'RES' column (see doc). On Windows this is an alias for
'wset' field and it matches "Memory" column of 'taskmgr.exe'.
Though an imperfect measure of the true impact to R, it does indicate a significantly smaller impact on R when using DuckDB.
Also, each method is done in a fresh instance of R --vanilla
. No .Rprofile
or site-init files are loaded. The code you see is the code that is executed, nothing more.
In R via data.frame
Sys.getpid()
# [1] 20860
file.info("quux.csv")["size"] / (1024^2) # MBs
# size
# quux.csv 299.3079
mem1 <- ps::ps_memory_info()["rss"]
dat <- read.csv("quux.csv")
mem2 <- ps::ps_memory_info()["rss"]
arrow::write_parquet(dat, "quux1.pq")
mem3 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, mem3, diff = mem3 - mem1) / (1024^2)
# rss rss rss diff.rss
# 57.70703 1218.55859 1548.54688 1490.83984
This indicates R is 1490MB larger after reading in the full data. (FYI, data.table::fread
instead of read.csv
results in only 408MB of memory gain, same austere conditions. I'm not trying to optimize this part, though
(FYI, these numbers vary for me from run-to-run and are likely to be different based on other factors outside the scope of this answer. My laptop has 64GB of RAM, it might not be comparable to exactly what you see.)
DuckDB, read from CSV, write to parquet
Sys.getpid()
# [1] 32485
mem1 <- ps::ps_memory_info()["rss"]
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv')) to 'quux2.pq' (format parquet)")
# [1] 1000207
mem2 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, diff=mem2 - mem1) / (1024^2)
# rss rss diff.rss
# 63.23828 86.35938 23.12109
showing only 23MB in this process.
Comparing the resulting files.
file.info(list.files(pattern = "quux.*"))["size"] / (1024^2)
# size
# quux.csv 299.30786
# quux1.pq 51.69008
# quux2.pq 66.84857
The larger file is due to the differences in class noted below. My guess is that if we force some of the character
columns to be logical
, then its file-size might be reduced.
A little more in-depth look at the contents:
ds1 <- arrow::open_dataset("quux1.pq")
ds2 <- arrow::open_dataset("quux2.pq")
identical(names(ds1), names(ds2))
# [1] TRUE
data.frame(
ds1 = sapply(head(ds1, 1), function(z) class(z)[1]),
ds2 = sapply(head(ds2, 1), function(z) class(z)[1])
)
# ds1 ds2
# V1 character character
# V2 integer integer
# V3 character character
# V4 integer integer
# V5 logical character
# V6 integer integer
# V7 character POSIXct
# V8 logical character
# V9 numeric numeric
# V10 numeric numeric
# V11 numeric integer
# V12 integer integer
# V13 integer integer
# V14 integer integer
# V15 numeric numeric
# V16 integer integer
# V17 integer integer
# V18 numeric numeric
# V19 numeric numeric
# V20 logical character
# V21 numeric numeric
# V22 numeric numeric
# V23 numeric numeric
# V24 integer integer
# V25 logical character
# V26 integer integer
# V27 integer integer
# V28 integer integer
# V29 integer integer
# V30 logical character
# V31 logical character
# V32 numeric numeric
# V33 logical character
# V34 logical character
# V35 logical character
# V36 logical character
# V37 logical character
# V38 logical character
# V39 character POSIXct
# V40 logical character
# V41 logical character
# V42 numeric integer
# V43 logical character
# V44 logical character
# V45 logical character
# V46 logical character
# V47 numeric numeric
# V48 logical character
# V49 logical character
# V50 logical character
# V51 logical character
# V52 logical character
# V53 logical character
# V54 logical character
# V55 logical character
# V56 logical character
# V57 logical character
Some interesting things to deduce from this:
- two fields are timestamps, and the duckdb method correctly identified them, parsed them, and stored as numeric timestamps; since I didn't tell R the column classes explicitly, it defaulted to
character
for them; - all of the columns that are
logical
inds1
andcharacter
inds2
are all null (sorry, it was the data I had); the fact that they are different classes indicate thatduckdb
defaults to string-like nulls instead of "bit", may or may not be a factor for you; - only two columns were classified as
numeric
-vs-integer
;V11
was truly integer, it's fine; the second one,V42
shows that the heuristic used for differentiating betweennumeric
andinteger
missed something. The first row ofV42
that contained any fractional component was on row 37159.
Fixing data discrepancies
Column V42
indicates that we need to be very cognizant of what is going in and out of that parquet generator. My guess is that it's in the "CSV Import" step, so looking at CSV Loading suggests the need to change the SAMPLE_SIZE
. While relatively inefficient, I'll use -1
indicating that it needs to look at all values in a column to determine its class. Slower, yes, but also safer.
Validation of this assumption:
> str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv') limit 5")[c("V11","V42")])
'data.frame': 5 obs. of 2 variables:
$ V11: int 4407 4408 4408 4407 4408
$ V42: int 26 25 23 21 3
> str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1) limit 5")[c("V11","V42")])
'data.frame': 5 obs. of 2 variables:
$ V11: int 4407 4408 4408 4407 4408
$ V42: num 26 25 23 21 3
Indeed, V11
is still good, and V42
changes from int
to num
.
After rerunning with this new parameter,
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")
offline validation confirmed that all values are correct.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论