Str_extract是dplyr/sql查询的替代方法。

huangapple go评论103阅读模式
英文:

Str_extract alternative for dplyr/sql query

问题

我尝试通过使用str_extract或等效的stri_extract_all_regex函数与stringi包来更改大型数据框中的列。问题是,我需要保留的一列中包含1000多个字的条目,所以如果我使用已收集的数据框,代码运行得非常慢(大约5分钟的查询时间)。由于这些函数无法直接转换为SQL,如果在表格未收集的情况下运行代码,会产生错误,迫使我进行收集,这需要很长时间。我想知道是否有一种替代函数可以用于提取子字符串(基于正则表达式模式),并且与SQL直接兼容(允许我保持数据未收集以减少运行时间)。

library(DBI)
library(dplyr)
library(dbplyr)
library(dfcr)
library(stringr)
library(lubridate)
library(timeDate)
library(plotly)
library(openxlsx)
library(bit64)
library(cld3)
library(stringi)

doConn("odbc")
sc <- dbConnect(odbc::odbc(),"Impala")
options(scipen = 999)
currentdate <- Sys.Date()

DF1 <- sc %>% tbl(tblsan("rawframe1")) %>%
  filter(between (as.Date(createtime), as.Date("2022-11-01"), as.Date(currentdate))) %>%
  mutate(id = as.character(id),
         userid = as.character(userid)) %>%
  select(id, userid, body, createtime)
 
DF2 <- DF1 %>%
  left_join(., sc %>% tbl(tblsan("rawframe2")) %>%
              filter(as_of_date == "2023-07-24")) %>%
              /*IDEALLY NO COLLECT HERE*/
              mutate(user_id = stri_extract_all_regex(raw_user_key,"(?<=:)([0-9]*)")) %>%
              select(user_id,name,login) %>%
              distinct(), 
            by = c('userid' = 'user_id'))

数据框的示例条目:

DF1

ID USERID BODY CREATETIME
1234 12345567 文本段落 2022-11-05 11:05

DF2

NAME LOGIN RAW_USER_KEY
JOHN,DOE TTT123 MMDHTT:12345567

在我的代码中,我尝试使用stri_extract_all_regex从冒号后提取所有内容。再次强调的目标是不收集左连接的数据框(rawframe2),因为接下来将不得不收集df1,这将花费很长时间。

英文:

I'm trying to mutate a column in a large dataframe by using str_extract or equivelently stri_extract_all_regex with the stringi package. Problem is one column I need to keep has entries with 1000+ words in them so the code is ridiculously slow if I work with a collected frame (~5 minute queries). Since those functions are not translatable to sql directly, running the code while the table is uncollected spits out errors forcing me to collect which takes forever. I want to know if there is an alternative function to use to extract substrings (based on regex patterns) that is directly compatible with sql (allowing me to keep the data uncollected reducing running time)

library(DBI)
library(dplyr)
library(dbplyr)
library(dfcr)
library(stringr)
library(lubridate)
library(timeDate)
library(plotly)
library(openxlsx)
library(bit64)
library(cld3)
library(stringi)

doConn(&quot;odbc&quot;)
sc &lt;- dbConnect(odbc::odbc(),&quot;Impala&quot;)
options(scipen = 999)
currentdate &lt;- Sys.Date() 

DF1 &lt;- sc %&gt;% tbl(tblsan(&quot;rawframe1&quot;)) %&gt;%
  filter(between (as.Date(createtime), as.Date(&quot;2022-11-01&quot;), as.Date(currentdate))) %&gt;%
  mutate(id = as.character(id),
         userid = as.character(userid)) %&gt;%
  select(id, userid, body, createtime)
  
 
DF2 &lt;- DF1 %&gt;%
  left_join(., sc %&gt;% tbl(tblsan(&quot;rawframe2&quot;)) %&gt;%
              filter(as_of_date == &quot;2023-07-24&quot;) %&gt;%
              collect() %&gt;% /*IDEALLY NO COLLECT HERE*/
              mutate(user_id = stri_extract_all_regex(raw_user_key,&quot;(?&lt;=:)([0-9]*)&quot;)) %&gt;%
              select(user_id,name,login) %&gt;%
              distinct(), 
            by = c(&#39;userid&#39; = &#39;user_id&#39;))

Example Entries of the DataFrames

DF1

ID USERID BODY CREATETIME
1234 12345567 paragraphs of text 2022-11-05 11:05

RAWTABLE2

NAME LOGIN RAW_USER_KEY
JOHN,DOE TTT123 MMDHTT:12345567

#in my code, im trying to extract everything after the colon with stri_extract_all_regex.
Again, goal is to not collect the left_joined df (rawframe2) since ill then have to collect df1 which will take forever

答案1

得分: 1

免责声明:我不太熟悉Impala的SQL方言,也没有访问实例进行全面测试。

对于dbplyr无法自动翻译的任何函数,您可以使用自定义SQL来编写。根据Impala文档,似乎REGEXP_EXTRACT是最接近的内置函数。

DF2 <- DF1 %>%
  left_join(., sc %>%
              tbl(tblsan("rawframe2")) %>%
              filter(as_of_date == "2023-07-24") %>%
              mutate(user_id = sql("REGEXP_EXTRACT(raw_user_key,'(?<=:)([0-9]*)', 0)")) %>%
              select(user_id, name, login) %>%
              distinct(), 
            by = c('userid' = 'user_id'))
英文:

Disclaimer: I'm not specifically well versed in Impala's SQL dialect nor do I have access to an instance to fully test this.

Any function for which dbplyr doesn't offer an automatic translation, you can write in custom SQL. From the Impala documentation, it seems like REGEXP_EXTRACT would be the closest built-in function.

DF2 &lt;- DF1 %&gt;%
  left_join(., sc %&gt;% tbl(tblsan(&quot;rawframe2&quot;)) %&gt;%
              filter(as_of_date == &quot;2023-07-24&quot;) %&gt;%
              mutate(user_id = sql(&quot;REGEXP_EXTRACT(raw_user_key,&#39;(?&lt;=:)([0-9]*)&#39;, 0)&quot;)) %&gt;%
              select(user_id,name,login) %&gt;%
              distinct(), 
            by = c(&#39;userid&#39; = &#39;user_id&#39;))

huangapple
  • 本文由 发表于 2023年7月27日 21:41:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76780363.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定