Key Phrase Search in String (在字符串中搜索关键短语)

huangapple go评论73阅读模式
英文:

Key Phrase Search in String

问题

我需要查找包含特定短语的 ID 数量在一个长字符串中。通常我会像这样使用 SQL:

select
  count(distinct id)
from
  table
where
  lower(string) like '%phrase 1%'
  or lower(string) like '%phrase 2%'
  or lower(string) like '%phrase 3%'

对于较小的数据集,这种查询不会引起问题,但在这种情况下,我的表格大小和字符串字段的长度肯定会耗费很长时间。

因为我正在使用 Databricks 笔记本,我在考虑使用 Python 或 R 来查询和筛选这个数据集以获得我想要的计数。我对这个过程的经验较少,所以任何建议都将不胜感激。

英文:

I am needing to find a count of IDs that contain certain phrases in a long string. Normally I would use SQL like this:

select
  count(distinct id)
from
  table
where
  lower(string) like '%phrase 1%'
  or lower(string) like '%phrase 2%'
  or lower(string) like '%phrase 3%'

For smaller datasets, a query like this does not cause me any issues, but the size of my table and length of the sting field in this instance will certainly take forever.

Since I am using databricks notebooks, I'm thinking that a solution might be to use either python or R to query and filter through this dataset to get the counts I am looking for. I am less experienced with this process, so any tips would be greatly appreciated.

答案1

得分: 1

在R中,我们可以构建类似这样的where子句:

phrases <- c("phrase 1%", "phrase 2%", "phrase 3%")
where <- paste0("(", paste(paste("lower(string) like", sQuote(phrases, FALSE)), collapse = " OR "), ")")
where
# [1] "(lower(string) like 'phrase 1%' OR lower(string) like 'phrase 2%' OR lower(string) like 'phrase 3%')"

然后,你可以相对容易地将它添加到你查询的其余部分中。

你可能考虑使用参数化查询而不是将编程字符串文字硬编码到你的查询中,你可以像这样操作:

where <- paste0("(", paste(rep("lower(string) like ?", length(phrases)), collapse = " OR "), ")")
where
# [1] "(lower(string) like ? OR lower(string) like ? OR lower(string) like ?)"

res <- DBI::dbGetQuery(con,
  paste("select count(distinct id) from table where", where),
  params = as.list(phrases))

虽然不是必需的,但它可以减轻由于 ANSI 字符串问题、编程错误、格式不正确的引用字符串等而导致的意外 SQL 注入风险。

作为标记的?符号是依赖于DBMS的。例如,Postgres 使用 $1$2 等;SQLite 可以使用 ??1/?2:name;SQL Server 使用 ?。在我的使用中,ODBC 连接(在R中连接到Windows时)总是使用 ?,不确定是否根据操作系统、调用语言或其他因素而有所不同。

英文:

In R, we can build the where clause like this:

phrases &lt;- c(&quot;phrase 1%&quot;, &quot;phrase 2%&quot;, &quot;phrase 3%&quot;)
where &lt;- paste0(&quot;(&quot;, paste(paste(&quot;lower(string) like&quot;, sQuote(phrases, FALSE)), collapse = &quot; OR &quot;), &quot;)&quot;)
where
# [1] &quot;(lower(string) like &#39;phrase 1%&#39; OR lower(string) like &#39;phrase 2%&#39; OR lower(string) like &#39;phrase 3%&#39;)&quot;

which can then be added to the rest of your query fairly easily.

You may want to consider parameterized queries instead of hard-coding programmatic string literals into your query, where you might instead do something like:

where &lt;- paste0(&quot;(&quot;, paste(rep(&quot;lower(string) like ?&quot;, length(phrases)), collapse = &quot; OR &quot;), &quot;)&quot;)
where
# [1] &quot;(lower(string) like ? OR lower(string) like ? OR lower(string) like ?)&quot;

res &lt;- DBI::dbGetQuery(con,
  paste(&quot;select count(distinct id) from table where&quot;, where),
  params = as.list(phrases))

While not required, it does mitigate inadvertent SQL-injection (albeit not malevolent) due to ANSI string issues, programming mistakes, malformed quoted strings, etc.

The use of ? as a marker is DBMS-dependent. For instance, Postgres uses $1, $2, ...; SQLite can use ?, ?1/?2, or :name; SQL Server uses ?. In my use, ODBC-connections (when connecting on windows in R) always use ?, not sure if it differs based on the OS or calling language or something else.

huangapple
  • 本文由 发表于 2023年5月10日 23:01:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76219976.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定