2023年7月27日 21:41:07go评论128阅读模式

英文:

Str_extract alternative for dplyr/sql query

问题

我尝试通过使用str_extract或等效的stri_extract_all_regex函数与stringi包来更改大型数据框中的列。问题是，我需要保留的一列中包含1000多个字的条目，所以如果我使用已收集的数据框，代码运行得非常慢（大约5分钟的查询时间）。由于这些函数无法直接转换为SQL，如果在表格未收集的情况下运行代码，会产生错误，迫使我进行收集，这需要很长时间。我想知道是否有一种替代函数可以用于提取子字符串（基于正则表达式模式），并且与SQL直接兼容（允许我保持数据未收集以减少运行时间）。

library(DBI)
library(dplyr)
library(dbplyr)
library(dfcr)
library(stringr)
library(lubridate)
library(timeDate)
library(plotly)
library(openxlsx)
library(bit64)
library(cld3)
library(stringi)
doConn("odbc")
sc <- dbConnect(odbc::odbc(),"Impala")
options(scipen = 999)
currentdate <- Sys.Date()
DF1 <- sc %>% tbl(tblsan("rawframe1")) %>%
  filter(between (as.Date(createtime), as.Date("2022-11-01"), as.Date(currentdate))) %>%
  mutate(id = as.character(id),
         userid = as.character(userid)) %>%
  select(id, userid, body, createtime)
 
DF2 <- DF1 %>%
  left_join(., sc %>% tbl(tblsan("rawframe2")) %>%
              filter(as_of_date == "2023-07-24")) %>%
              /*IDEALLY NO COLLECT HERE*/
              mutate(user_id = stri_extract_all_regex(raw_user_key,"(?<=:)([0-9]*)")) %>%
              select(user_id,name,login) %>%
              distinct(), 
            by = c('userid' = 'user_id'))

数据框的示例条目：

DF1

ID	USERID	BODY	CREATETIME
1234	12345567	文本段落	2022-11-05 11:05

DF2

NAME	LOGIN	RAW_USER_KEY
JOHN,DOE	TTT123	MMDHTT:12345567

在我的代码中，我尝试使用stri_extract_all_regex从冒号后提取所有内容。再次强调的目标是不收集左连接的数据框（rawframe2），因为接下来将不得不收集df1，这将花费很长时间。

英文:

I'm trying to mutate a column in a large dataframe by using str_extract or equivelently stri_extract_all_regex with the stringi package. Problem is one column I need to keep has entries with 1000+ words in them so the code is ridiculously slow if I work with a collected frame (~5 minute queries). Since those functions are not translatable to sql directly, running the code while the table is uncollected spits out errors forcing me to collect which takes forever. I want to know if there is an alternative function to use to extract substrings (based on regex patterns) that is directly compatible with sql (allowing me to keep the data uncollected reducing running time)

library(DBI)
library(dplyr)
library(dbplyr)
library(dfcr)
library(stringr)
library(lubridate)
library(timeDate)
library(plotly)
library(openxlsx)
library(bit64)
library(cld3)
library(stringi)
doConn(&quot;odbc&quot;)
sc &lt;- dbConnect(odbc::odbc(),&quot;Impala&quot;)
options(scipen = 999)
currentdate &lt;- Sys.Date() 
DF1 &lt;- sc %&gt;% tbl(tblsan(&quot;rawframe1&quot;)) %&gt;%
  filter(between (as.Date(createtime), as.Date(&quot;2022-11-01&quot;), as.Date(currentdate))) %&gt;%
  mutate(id = as.character(id),
         userid = as.character(userid)) %&gt;%
  select(id, userid, body, createtime)
  
 
DF2 &lt;- DF1 %&gt;%
  left_join(., sc %&gt;% tbl(tblsan(&quot;rawframe2&quot;)) %&gt;%
              filter(as_of_date == &quot;2023-07-24&quot;) %&gt;%
              collect() %&gt;% /*IDEALLY NO COLLECT HERE*/
              mutate(user_id = stri_extract_all_regex(raw_user_key,&quot;(?&lt;=:)([0-9]*)&quot;)) %&gt;%
              select(user_id,name,login) %&gt;%
              distinct(), 
            by = c(&#39;userid&#39; = &#39;user_id&#39;))

Example Entries of the DataFrames

DF1

ID	USERID	BODY	CREATETIME
1234	12345567	paragraphs of text	2022-11-05 11:05

RAWTABLE2

NAME	LOGIN	RAW_USER_KEY
JOHN,DOE	TTT123	MMDHTT:12345567

#in my code, im trying to extract everything after the colon with stri_extract_all_regex.
Again, goal is to not collect the left_joined df (rawframe2) since ill then have to collect df1 which will take forever

答案1

得分: 1

免责声明：我不太熟悉Impala的SQL方言，也没有访问实例进行全面测试。

对于dbplyr无法自动翻译的任何函数，您可以使用自定义SQL来编写。根据Impala文档，似乎REGEXP_EXTRACT是最接近的内置函数。

DF2 <- DF1 %>%
  left_join(., sc %>%
              tbl(tblsan("rawframe2")) %>%
              filter(as_of_date == "2023-07-24") %>%
              mutate(user_id = sql("REGEXP_EXTRACT(raw_user_key,'(?<=:)([0-9]*)', 0)")) %>%
              select(user_id, name, login) %>%
              distinct(), 
            by = c('userid' = 'user_id'))

英文:

Disclaimer: I'm not specifically well versed in Impala's SQL dialect nor do I have access to an instance to fully test this.

Any function for which dbplyr doesn't offer an automatic translation, you can write in custom SQL. From the Impala documentation, it seems like REGEXP_EXTRACT would be the closest built-in function.

DF2 &lt;- DF1 %&gt;%
  left_join(., sc %&gt;% tbl(tblsan(&quot;rawframe2&quot;)) %&gt;%
              filter(as_of_date == &quot;2023-07-24&quot;) %&gt;%
              mutate(user_id = sql(&quot;REGEXP_EXTRACT(raw_user_key,&#39;(?&lt;=:)([0-9]*)&#39;, 0)&quot;)) %&gt;%
              select(user_id,name,login) %&gt;%
              distinct(), 
            by = c(&#39;userid&#39; = &#39;user_id&#39;))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Str_extract是dplyr/sql查询的替代方法。

问题

答案1

在golang中删除正则表达式的第一个匹配项。

In R, how can I create a function that can take values from columns using dplyr::mutate, but also still take specific strings as values?

Mutate new columns and intercalate them with old ones.

nops_eval在扫描响应中出现错误。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。