2023年2月14日 19:09:35go评论55阅读模式

英文:

Regex - extract last term between _ and before . from path

问题

以下是已翻译的内容：

这是我当前正在测试的正则表达式

[\w\. ]+(?=[\.])

我的最终目标是在Impala/Hive查询中使用**regexp_extract**来提取正则表达式。

regexp_extract(col, '[\w\. ]+(?=[\.])', 1)

然而，这在Impala中不起作用。

要提取的路径示例：

D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua

我测试过的正则表达式提取了我要查找的术语，但不够好，对于第二、第三和第四种情况，我需要提取最后一个下划线后面的部分。

我的期望是：

Program1
Test-program
program
Case-general

有什么建议吗？我也可以考虑使用除了regexp_extract之外的东西。

注意：已经去除了代码部分，只返回翻译好的内容。

英文:

This is the regex that I'm currently testing

[\w\. ]+(?=[\.])

My ultimate goal is to include a regex expression to extract using regexp_extract in Impala/Hive query.

regexp_extract(col, &#39;[\w\. ]+(?=[\.])&#39;, 1)

This doesn't work in Impala however.

Examples of path to extract from:

D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua

The regex I've tested extracts the term I'm looking for but it's not good enough, for the second and third, fourth cases I would need to extract only the part after the last underscore.

My expections are:

Program1
Test-program
program
Case-general

Any suggestions? I'm also open to using something other than regexp_extract.

答案1

得分: 1

请注意，Impala正则表达式不支持预查，因此您需要使用捕获组来从整体匹配中获取子匹配。此外，如果在模式中使用转义符\，请确保它是双倍的。

您可以使用以下正则表达式：

regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)

查看正则表达式演示。

这个正则表达式的含义是：

([^-_\\]+) - 第1组：一个或多个字符，不包括-、_和\
\. - 一个句点
\w+ - 一个或多个单词字符
$ - 字符串的结尾。

英文:

Note that Impala regex does not support lookarounds, and thus you need a capturing group to get a submatch out of the overall match. Also, if you use escaping \ in the pattern, make sure it is doubled.

You can use

regexp_extract(col, &#39;([^-_\\\\]+)\\.\\w+$&#39;, 1)

See the regex demo.

The regex means

([^-_\\]+) - Group 1: one or more chars other than -, _ and \
\. - a dot
\w+ - one or more word chars
$ - end of string.

答案2

得分: 1

使用\w也会匹配下划线，您可以使用[a-zA-Z0-9]代替。

在字符类中添加匹配点和连字符，并在第一个捕获组中捕获它，然后匹配期望的结尾点。

请注意，在字符类中无需转义句点。

([a-zA-Z0-9.-]+)[.]

查看regex101演示

在使用regexp_extract的示例中，, 1获取第一个捕获组的值：

regexp_extract(col, '([a-zA-Z0-9.-]+)[.]', 1)

如果它应该只出现在字符串的末尾，匹配最后一个点而不匹配中间的任何反斜杠：

regexp_extract(col, '([a-zA-Z0-9.-]+)[.][^\\\\.]+$', 1)

英文:

Using \w also matches an underscore, instead you can use [a-zA-Z0-9] instead.

Add matching a dot and hyphen in the character class, capture that in group 1 and match the expected trailing dot.

Note that you don't have to escape dots in a character class.

([a-zA-Z0-9.-]+)[.]

See a regex101 demo

Example using regexp_extract where the , 1 gets the group 1 value:

regexp_extract(col, &#39;([a-zA-Z0-9.-]+)[.]&#39;, 1)

If it should be at the end of the string only, matching the last dot without matching any backslashes in between:

 regexp_extract(col, &#39;([a-zA-Z0-9.-]+)[.][^\\\\.]+$&#39;, 1)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取路径中最后一个下划线(_)和点(.)之间的术语。

问题

答案1

答案2

正则表达式中的嵌套重复。

Java正则表达式以双大括号外的破折号分割字符串

@Path中的正则表达式只匹配了两个指定的路由中的一个，导致了404错误。

Split a string in SQL Server using regex and use the resulted array to populate many columns in a newly created table

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论