Impala查询(dbplyr)错误:遇到标识符:预期:(

huangapple go评论97阅读模式
英文:

Impala Query(dbplyr) Error : Encountered Identifier : expected: (

问题

我目前正在处理一项野牛数据库(Impala DB)的工作,我在使用dbplyr的SQL翻译时遇到了一些问题。

这是我的代码的第一次迭代,如果在之前收集(collect)表格的话,它可以在R中运行(这并不是我想要的,因为这需要很长时间):

DF2_V1 <- DF1 %>%
  filter(indicator != "N") %>%
  group_by(id) %>%
  filter(!("Y" %in% indicator) | (indicator == "Y"),
         !("ANALYSIS" %in% indicator) | (indicator != "RECOMMENDED")) %>%
  filter(time1 == min(time1)) %>%
  ungroup() %>%
  mutate(time_diff = time1 - time2) %>%
  select(id, indicator, time1, time2, time_diff) %>% 
  show_query() %>%
  collect()

基本上,这段代码的目标是从DF1表格中进行以下逻辑操作:

如果给定的id具有任何Y指示符,则删除其他指示符并保留该指示符的最早迭代。如果没有Y出现,则我们更喜欢指示符ANALYSIS,并选择第一个迭代(最早的时间),如果不存在ANALYSIS,则选择RECOMMENDED。这段代码在R中运行良好(在之前收集DF1的情况下),并且做我想要的事情。但是当DF1表格没有被收集(uncollected)并且我们执行SQL查询时,我遇到了以下错误:

Error in new_result(connection@ptr, statement, immediate) :   
nanodbc/nanodbc.cpp:1412: 00000: [RStudio][ImpalaODBC] (360)   
Syntax error occurred during query execution: [HY000] :   
AnalysisException: Syntax error in line 32:

WHEN ('Y' IN indicator) THEN 'Y'  
                  ^  
Encountered: IDENTIFIER  
Expected: (

CAUSED BY: Exception: Syntax error

我对数据库查询仍然很陌生,不确定如何解决这个问题,所以我尝试在R中使用dbplyr中的SQL脚本进行代码重写,希望能够澄清我的逻辑:

DF2_V2 <- DF1 %>%
  filter(indicator != "NULL") %>%
  group_by(id) %>%
  mutate(indicator = case_when(
    sql("'Y' IN indicator") ~ "Y",
    sql("('ANALYSIS' IN indicator) AND (indicator != 'RECOMMENDED')") ~ "ANALYSIS",
    TRUE ~ "RECOMMENDED")) %>%
  filter(time1 == min(time1)) %>%
  mutate(time_diff = time1 - time2) %>%
  select(...) %>%
  collect()

这也导致了相同的错误。我还尝试直接在数据库中使用show_query翻译来查看是否是R的连接问题,但最终得出了相同的结论。不确定是我的代码本身有问题还是翻译成SQL时出了问题,但我似乎找不到问题所在。

英文:

I'm currently working on an impala db and I'm having some problems with dbplyr's SQL translation.

This is the first iteration of my code which works in R if I collect the table beforehand (which is not something I want to do as it takes forever):

DF2_V1 &lt;- DF1 %&gt;%
  filter(indicator != &quot;N&quot;) %&gt;%
  group_by(id) %&gt;%
  filter(!(&quot;Y&quot; %in% indicator) | (indicator == &quot;Y&quot;),
         !(&quot;ANALYSIS&quot; %in% indicator) | (indicator != &quot;RECOMMENDED&quot;)) %&gt;%
  filter(time1 == min(time1)) %&gt;%
  ungroup() %&gt;%
  mutate(time_diff = time1 - time2) %&gt;%
  select(id,indicator,time1,time2,time_diff %&gt;% show_query() %&gt;% collect()

Essentially the goal of this code is to take DF1 ;

ID INDICATOR TIME1 TIME2
1 Y ... .....
1 N ... .....
1 RECOMMEND ... .....
2 RECOMMEND ... .....
2 ANALYSIS ... .....

And perform the following logic: If a given id has any Y indicator remove the others and keep the earliest iteration of that indicator. If Y is not present, we favor the indicator ANALYSIS instead and take the first iteration (earliest time), if not we take RECOMMENDED. This code works fine in R (when collecting DF1 beforehand) and does what I want however when the DF1 table is uncollected and we're performing a SQL query I get the following error:

> Error in new_result(connection@ptr, statement, immediate) :
> nanodbc/nanodbc.cpp:1412: 00000: [RStudio][ImpalaODBC] (360)
> Syntax error occurred during query execution: [HY000] :
> AnalysisException: Syntax error in line 32:
>
> WHEN ('Y' IN indicator) THEN 'Y'
> ^
> Encountered: IDENTIFIER
> Expected: (
>
> CAUSED BY: Exception: Syntax error

I'm still quite new to db queries and I was not sure what to make of this so I tried rewriting the code in R using SQL script in dbplyr which made some minor modifications hoping to clarify my logic:

DF2_V2 &lt;- DF1 %&gt;%
  filter(indicator != &quot;NULL&quot;) %&gt;%
  group_by(id) %&gt;%
  mutate(indicator = case_when(
    sql(&quot;&#39;Y&#39; IN indicator&quot;) ~ &quot;Y&quot;,
    sql(&quot;(&#39;ANALYSIS&#39; IN indicator) AND (indicator != &#39;RECOMMENDED&#39;)&quot;) ~ &quot;ANALYSIS&quot;,
    TRUE ~ &quot;RECOMMENDED&quot;)) %&gt;%
  filter(time1 == min(time1)) %&gt;%
  mutate(time_diff = time1 - time2) %&gt;% select(...) %&gt;% collect()

This presented with the same error. I also tried my queries directly in the db using the show_query translation to see if it was a problem with R's connection but inevitably came to the same conclusion. Not sure if my code itself is faulty or the translation into SQL is being messed up but I cant seem to find the problem.

答案1

得分: 1

我不确定这是否是dbplyr中的一个错误,但强制使用括号应该有效:

DF2_V1 <- DF1 %>%
  filter(indicator != "N") %>%
  group_by(id) %>%
  filter(!("Y" %in% indicator) | (indicator == "Y"),
         !("ANALYSIS" %in% indicator) | (indicator != "RECOMMENDED")) %>%
  filter(time1 == min(time1)) %>%
  ungroup() %>%
  mutate(time_diff = time1 - time2) %>%
  select(id, indicator, time1, time2, time_diff %>% show_query() %>% collect()

(这可能是特定于Impala后端/驱动程序,不确定。)

英文:

I don't know if it's a bug in dbplyr, but forcing the parens should work:

DF2_V1 &lt;- DF1 %&gt;%
  filter(indicator != &quot;N&quot;) %&gt;%
  group_by(id) %&gt;%
  filter(!(&quot;Y&quot; %in% (indicator)) | (indicator == &quot;Y&quot;),
         !(&quot;ANALYSIS&quot; %in% (indicator)) | (indicator != &quot;RECOMMENDED&quot;)) %&gt;%
  filter(time1 == min(time1)) %&gt;%
  ungroup() %&gt;%
  mutate(time_diff = time1 - time2) %&gt;%
  select(id,indicator,time1,time2,time_diff %&gt;% show_query() %&gt;% collect()

(This might be specific to the impala backend/driver, not sure.)

huangapple
  • 本文由 发表于 2023年8月5日 00:02:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76837530.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定