英文:
Str_extract alternative for dplyr/sql query
问题
我尝试通过使用str_extract
或等效的stri_extract_all_regex
函数与stringi
包来更改大型数据框中的列。问题是,我需要保留的一列中包含1000多个字的条目,所以如果我使用已收集的数据框,代码运行得非常慢(大约5分钟的查询时间)。由于这些函数无法直接转换为SQL,如果在表格未收集的情况下运行代码,会产生错误,迫使我进行收集,这需要很长时间。我想知道是否有一种替代函数可以用于提取子字符串(基于正则表达式模式),并且与SQL直接兼容(允许我保持数据未收集以减少运行时间)。
library(DBI)
library(dplyr)
library(dbplyr)
library(dfcr)
library(stringr)
library(lubridate)
library(timeDate)
library(plotly)
library(openxlsx)
library(bit64)
library(cld3)
library(stringi)
doConn("odbc")
sc <- dbConnect(odbc::odbc(),"Impala")
options(scipen = 999)
currentdate <- Sys.Date()
DF1 <- sc %>% tbl(tblsan("rawframe1")) %>%
filter(between (as.Date(createtime), as.Date("2022-11-01"), as.Date(currentdate))) %>%
mutate(id = as.character(id),
userid = as.character(userid)) %>%
select(id, userid, body, createtime)
DF2 <- DF1 %>%
left_join(., sc %>% tbl(tblsan("rawframe2")) %>%
filter(as_of_date == "2023-07-24")) %>%
/*IDEALLY NO COLLECT HERE*/
mutate(user_id = stri_extract_all_regex(raw_user_key,"(?<=:)([0-9]*)")) %>%
select(user_id,name,login) %>%
distinct(),
by = c('userid' = 'user_id'))
数据框的示例条目:
DF1
ID | USERID | BODY | CREATETIME |
---|---|---|---|
1234 | 12345567 | 文本段落 | 2022-11-05 11:05 |
DF2
NAME | LOGIN | RAW_USER_KEY |
---|---|---|
JOHN,DOE | TTT123 | MMDHTT:12345567 |
在我的代码中,我尝试使用stri_extract_all_regex
从冒号后提取所有内容。再次强调的目标是不收集左连接的数据框(rawframe2
),因为接下来将不得不收集df1
,这将花费很长时间。
英文:
I'm trying to mutate a column in a large dataframe by using str_extract or equivelently stri_extract_all_regex with the stringi package. Problem is one column I need to keep has entries with 1000+ words in them so the code is ridiculously slow if I work with a collected frame (~5 minute queries). Since those functions are not translatable to sql directly, running the code while the table is uncollected spits out errors forcing me to collect which takes forever. I want to know if there is an alternative function to use to extract substrings (based on regex patterns) that is directly compatible with sql (allowing me to keep the data uncollected reducing running time)
library(DBI)
library(dplyr)
library(dbplyr)
library(dfcr)
library(stringr)
library(lubridate)
library(timeDate)
library(plotly)
library(openxlsx)
library(bit64)
library(cld3)
library(stringi)
doConn("odbc")
sc <- dbConnect(odbc::odbc(),"Impala")
options(scipen = 999)
currentdate <- Sys.Date()
DF1 <- sc %>% tbl(tblsan("rawframe1")) %>%
filter(between (as.Date(createtime), as.Date("2022-11-01"), as.Date(currentdate))) %>%
mutate(id = as.character(id),
userid = as.character(userid)) %>%
select(id, userid, body, createtime)
DF2 <- DF1 %>%
left_join(., sc %>% tbl(tblsan("rawframe2")) %>%
filter(as_of_date == "2023-07-24") %>%
collect() %>% /*IDEALLY NO COLLECT HERE*/
mutate(user_id = stri_extract_all_regex(raw_user_key,"(?<=:)([0-9]*)")) %>%
select(user_id,name,login) %>%
distinct(),
by = c('userid' = 'user_id'))
Example Entries of the DataFrames
DF1
ID | USERID | BODY | CREATETIME |
---|---|---|---|
1234 | 12345567 | paragraphs of text | 2022-11-05 11:05 |
RAWTABLE2
NAME | LOGIN | RAW_USER_KEY |
---|---|---|
JOHN,DOE | TTT123 | MMDHTT:12345567 |
#in my code, im trying to extract everything after the colon with stri_extract_all_regex.
Again, goal is to not collect the left_joined df (rawframe2) since ill then have to collect df1 which will take forever
答案1
得分: 1
免责声明:我不太熟悉Impala的SQL方言,也没有访问实例进行全面测试。
对于dbplyr
无法自动翻译的任何函数,您可以使用自定义SQL来编写。根据Impala文档,似乎REGEXP_EXTRACT
是最接近的内置函数。
DF2 <- DF1 %>%
left_join(., sc %>%
tbl(tblsan("rawframe2")) %>%
filter(as_of_date == "2023-07-24") %>%
mutate(user_id = sql("REGEXP_EXTRACT(raw_user_key,'(?<=:)([0-9]*)', 0)")) %>%
select(user_id, name, login) %>%
distinct(),
by = c('userid' = 'user_id'))
英文:
Disclaimer: I'm not specifically well versed in Impala's SQL dialect nor do I have access to an instance to fully test this.
Any function for which dbplyr
doesn't offer an automatic translation, you can write in custom SQL. From the Impala documentation, it seems like REGEXP_EXTRACT
would be the closest built-in function.
DF2 <- DF1 %>%
left_join(., sc %>% tbl(tblsan("rawframe2")) %>%
filter(as_of_date == "2023-07-24") %>%
mutate(user_id = sql("REGEXP_EXTRACT(raw_user_key,'(?<=:)([0-9]*)', 0)")) %>%
select(user_id,name,login) %>%
distinct(),
by = c('userid' = 'user_id'))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论