英文:
Key Phrase Search in String
问题
我需要查找包含特定短语的 ID 数量在一个长字符串中。通常我会像这样使用 SQL:
select
count(distinct id)
from
table
where
lower(string) like '%phrase 1%'
or lower(string) like '%phrase 2%'
or lower(string) like '%phrase 3%'
对于较小的数据集,这种查询不会引起问题,但在这种情况下,我的表格大小和字符串字段的长度肯定会耗费很长时间。
因为我正在使用 Databricks 笔记本,我在考虑使用 Python 或 R 来查询和筛选这个数据集以获得我想要的计数。我对这个过程的经验较少,所以任何建议都将不胜感激。
英文:
I am needing to find a count of IDs that contain certain phrases in a long string. Normally I would use SQL like this:
select
count(distinct id)
from
table
where
lower(string) like '%phrase 1%'
or lower(string) like '%phrase 2%'
or lower(string) like '%phrase 3%'
For smaller datasets, a query like this does not cause me any issues, but the size of my table and length of the sting field in this instance will certainly take forever.
Since I am using databricks notebooks, I'm thinking that a solution might be to use either python or R to query and filter through this dataset to get the counts I am looking for. I am less experienced with this process, so any tips would be greatly appreciated.
答案1
得分: 1
在R中,我们可以构建类似这样的where
子句:
phrases <- c("phrase 1%", "phrase 2%", "phrase 3%")
where <- paste0("(", paste(paste("lower(string) like", sQuote(phrases, FALSE)), collapse = " OR "), ")")
where
# [1] "(lower(string) like 'phrase 1%' OR lower(string) like 'phrase 2%' OR lower(string) like 'phrase 3%')"
然后,你可以相对容易地将它添加到你查询的其余部分中。
你可能考虑使用参数化查询而不是将编程字符串文字硬编码到你的查询中,你可以像这样操作:
where <- paste0("(", paste(rep("lower(string) like ?", length(phrases)), collapse = " OR "), ")")
where
# [1] "(lower(string) like ? OR lower(string) like ? OR lower(string) like ?)"
res <- DBI::dbGetQuery(con,
paste("select count(distinct id) from table where", where),
params = as.list(phrases))
虽然不是必需的,但它可以减轻由于 ANSI 字符串问题、编程错误、格式不正确的引用字符串等而导致的意外 SQL 注入风险。
作为标记的?
符号是依赖于DBMS的。例如,Postgres 使用 $1
、$2
等;SQLite 可以使用 ?
、?1
/?2
或 :name
;SQL Server 使用 ?
。在我的使用中,ODBC 连接(在R中连接到Windows时)总是使用 ?
,不确定是否根据操作系统、调用语言或其他因素而有所不同。
英文:
In R, we can build the where
clause like this:
phrases <- c("phrase 1%", "phrase 2%", "phrase 3%")
where <- paste0("(", paste(paste("lower(string) like", sQuote(phrases, FALSE)), collapse = " OR "), ")")
where
# [1] "(lower(string) like 'phrase 1%' OR lower(string) like 'phrase 2%' OR lower(string) like 'phrase 3%')"
which can then be added to the rest of your query fairly easily.
You may want to consider parameterized queries instead of hard-coding programmatic string literals into your query, where you might instead do something like:
where <- paste0("(", paste(rep("lower(string) like ?", length(phrases)), collapse = " OR "), ")")
where
# [1] "(lower(string) like ? OR lower(string) like ? OR lower(string) like ?)"
res <- DBI::dbGetQuery(con,
paste("select count(distinct id) from table where", where),
params = as.list(phrases))
While not required, it does mitigate inadvertent SQL-injection (albeit not malevolent) due to ANSI string issues, programming mistakes, malformed quoted strings, etc.
The use of ?
as a marker is DBMS-dependent. For instance, Postgres uses $1
, $2
, ...; SQLite can use ?
, ?1
/?2
, or :name
; SQL Server uses ?
. In my use, ODBC-connections (when connecting on windows in R) always use ?
, not sure if it differs based on the OS or calling language or something else.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论