英文:
Regex - extract last term between _ and before . from path
问题
以下是已翻译的内容:
这是我当前正在测试的正则表达式
[\w\. ]+(?=[\.])
我的最终目标是在Impala/Hive查询中使用**regexp_extract**来提取正则表达式。
regexp_extract(col, '[\w\. ]+(?=[\.])', 1)
然而,这在Impala中不起作用。
要提取的路径示例:
D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua
我测试过的正则表达式提取了我要查找的术语,但不够好,对于第二、第三和第四种情况,我需要提取最后一个下划线后面的部分。
我的期望是:
Program1
Test-program
program
Case-general
有什么建议吗?我也可以考虑使用除了regexp_extract之外的东西。
注意:已经去除了代码部分,只返回翻译好的内容。
英文:
This is the regex that I'm currently testing
[\w\. ]+(?=[\.])
My ultimate goal is to include a regex expression to extract using regexp_extract in Impala/Hive query.
regexp_extract(col, '[\w\. ]+(?=[\.])', 1)
This doesn't work in Impala however.
Examples of path to extract from:
D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua
The regex I've tested extracts the term I'm looking for but it's not good enough, for the second and third, fourth cases I would need to extract only the part after the last underscore.
My expections are:
Program1
Test-program
program
Case-general
Any suggestions? I'm also open to using something other than regexp_extract.
答案1
得分: 1
请注意,Impala正则表达式不支持预查,因此您需要使用捕获组来从整体匹配中获取子匹配。此外,如果在模式中使用转义符\
,请确保它是双倍的。
您可以使用以下正则表达式:
regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)
查看正则表达式演示。
这个正则表达式的含义是:
([^-_\\]+)
- 第1组:一个或多个字符,不包括-
、_
和\
\.
- 一个句点\w+
- 一个或多个单词字符$
- 字符串的结尾。
英文:
Note that Impala regex does not support lookarounds, and thus you need a capturing group to get a submatch out of the overall match. Also, if you use escaping \
in the pattern, make sure it is doubled.
You can use
regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)
See the regex demo.
The regex means
([^-_\\]+)
- Group 1: one or more chars other than-
,_
and\
\.
- a dot\w+
- one or more word chars$
- end of string.
答案2
得分: 1
使用\w
也会匹配下划线,您可以使用[a-zA-Z0-9]
代替。
在字符类中添加匹配点和连字符,并在第一个捕获组中捕获它,然后匹配期望的结尾点。
请注意,在字符类中无需转义句点。
([a-zA-Z0-9.-]+)[.]
在使用regexp_extract的示例中,, 1
获取第一个捕获组的值:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.]', 1)
如果它应该只出现在字符串的末尾,匹配最后一个点而不匹配中间的任何反斜杠:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.][^\\\\.]+$', 1)
英文:
Using \w
also matches an underscore, instead you can use [a-zA-Z0-9]
instead.
Add matching a dot and hyphen in the character class, capture that in group 1 and match the expected trailing dot.
Note that you don't have to escape dots in a character class.
([a-zA-Z0-9.-]+)[.]
See a regex101 demo
Example using regexp_extract where the , 1
gets the group 1 value:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.]', 1)
If it should be at the end of the string only, matching the last dot without matching any backslashes in between:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.][^\\\\.]+$', 1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论