英文:
Impala Query(dbplyr) Error : Encountered Identifier : expected: (
问题
我目前正在处理一项野牛数据库(Impala DB)的工作,我在使用dbplyr的SQL翻译时遇到了一些问题。
这是我的代码的第一次迭代,如果在之前收集(collect)表格的话,它可以在R中运行(这并不是我想要的,因为这需要很长时间):
DF2_V1 <- DF1 %>%
filter(indicator != "N") %>%
group_by(id) %>%
filter(!("Y" %in% indicator) | (indicator == "Y"),
!("ANALYSIS" %in% indicator) | (indicator != "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
ungroup() %>%
mutate(time_diff = time1 - time2) %>%
select(id, indicator, time1, time2, time_diff) %>%
show_query() %>%
collect()
基本上,这段代码的目标是从DF1表格中进行以下逻辑操作:
如果给定的id具有任何Y指示符,则删除其他指示符并保留该指示符的最早迭代。如果没有Y出现,则我们更喜欢指示符ANALYSIS,并选择第一个迭代(最早的时间),如果不存在ANALYSIS,则选择RECOMMENDED。这段代码在R中运行良好(在之前收集DF1的情况下),并且做我想要的事情。但是当DF1表格没有被收集(uncollected)并且我们执行SQL查询时,我遇到了以下错误:
Error in new_result(connection@ptr, statement, immediate) :
nanodbc/nanodbc.cpp:1412: 00000: [RStudio][ImpalaODBC] (360)
Syntax error occurred during query execution: [HY000] :
AnalysisException: Syntax error in line 32:
WHEN ('Y' IN indicator) THEN 'Y'
^
Encountered: IDENTIFIER
Expected: (
CAUSED BY: Exception: Syntax error
我对数据库查询仍然很陌生,不确定如何解决这个问题,所以我尝试在R中使用dbplyr中的SQL脚本进行代码重写,希望能够澄清我的逻辑:
DF2_V2 <- DF1 %>%
filter(indicator != "NULL") %>%
group_by(id) %>%
mutate(indicator = case_when(
sql("'Y' IN indicator") ~ "Y",
sql("('ANALYSIS' IN indicator) AND (indicator != 'RECOMMENDED')") ~ "ANALYSIS",
TRUE ~ "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
mutate(time_diff = time1 - time2) %>%
select(...) %>%
collect()
这也导致了相同的错误。我还尝试直接在数据库中使用show_query翻译来查看是否是R的连接问题,但最终得出了相同的结论。不确定是我的代码本身有问题还是翻译成SQL时出了问题,但我似乎找不到问题所在。
英文:
I'm currently working on an impala db and I'm having some problems with dbplyr's SQL translation.
This is the first iteration of my code which works in R if I collect the table beforehand (which is not something I want to do as it takes forever):
DF2_V1 <- DF1 %>%
filter(indicator != "N") %>%
group_by(id) %>%
filter(!("Y" %in% indicator) | (indicator == "Y"),
!("ANALYSIS" %in% indicator) | (indicator != "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
ungroup() %>%
mutate(time_diff = time1 - time2) %>%
select(id,indicator,time1,time2,time_diff %>% show_query() %>% collect()
Essentially the goal of this code is to take DF1 ;
ID | INDICATOR | TIME1 | TIME2 |
---|---|---|---|
1 | Y | ... | ..... |
1 | N | ... | ..... |
1 | RECOMMEND | ... | ..... |
2 | RECOMMEND | ... | ..... |
2 | ANALYSIS | ... | ..... |
And perform the following logic: If a given id has any Y indicator remove the others and keep the earliest iteration of that indicator. If Y is not present, we favor the indicator ANALYSIS instead and take the first iteration (earliest time), if not we take RECOMMENDED. This code works fine in R (when collecting DF1 beforehand) and does what I want however when the DF1 table is uncollected and we're performing a SQL query I get the following error:
> Error in new_result(connection@ptr, statement, immediate) :
> nanodbc/nanodbc.cpp:1412: 00000: [RStudio][ImpalaODBC] (360)
> Syntax error occurred during query execution: [HY000] :
> AnalysisException: Syntax error in line 32:
>
> WHEN ('Y' IN indicator) THEN 'Y'
> ^
> Encountered: IDENTIFIER
> Expected: (
>
> CAUSED BY: Exception: Syntax error
I'm still quite new to db queries and I was not sure what to make of this so I tried rewriting the code in R using SQL script in dbplyr which made some minor modifications hoping to clarify my logic:
DF2_V2 <- DF1 %>%
filter(indicator != "NULL") %>%
group_by(id) %>%
mutate(indicator = case_when(
sql("'Y' IN indicator") ~ "Y",
sql("('ANALYSIS' IN indicator) AND (indicator != 'RECOMMENDED')") ~ "ANALYSIS",
TRUE ~ "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
mutate(time_diff = time1 - time2) %>% select(...) %>% collect()
This presented with the same error. I also tried my queries directly in the db using the show_query translation to see if it was a problem with R's connection but inevitably came to the same conclusion. Not sure if my code itself is faulty or the translation into SQL is being messed up but I cant seem to find the problem.
答案1
得分: 1
我不确定这是否是dbplyr
中的一个错误,但强制使用括号应该有效:
DF2_V1 <- DF1 %>%
filter(indicator != "N") %>%
group_by(id) %>%
filter(!("Y" %in% indicator) | (indicator == "Y"),
!("ANALYSIS" %in% indicator) | (indicator != "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
ungroup() %>%
mutate(time_diff = time1 - time2) %>%
select(id, indicator, time1, time2, time_diff %>% show_query() %>% collect()
(这可能是特定于Impala后端/驱动程序,不确定。)
英文:
I don't know if it's a bug in dbplyr
, but forcing the parens should work:
DF2_V1 <- DF1 %>%
filter(indicator != "N") %>%
group_by(id) %>%
filter(!("Y" %in% (indicator)) | (indicator == "Y"),
!("ANALYSIS" %in% (indicator)) | (indicator != "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
ungroup() %>%
mutate(time_diff = time1 - time2) %>%
select(id,indicator,time1,time2,time_diff %>% show_query() %>% collect()
(This might be specific to the impala backend/driver, not sure.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论