2023年2月7日 03:50:38go评论101阅读模式

英文:

extracting year from string using regexp_extract pyspark

问题

这是我的结果的一部分：
1995
2006
2013
2009
1952
2017
1957
1453&lt;--
1871&lt;--
   &lt;--最后一个标题的部分是空的，也应该是空的

英文:

This is the portion of my result :

Grumpier Old Men (1995)
Death Note: Desu n&#195;&#180;to (2006&#226;€“2007)
Irwin &amp; Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing) 
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane

return:

1995
2006
2013
2009
1952
2017
1957
1453&lt;--
1871&lt;--
   &lt;--this part for last title is empty and supposed to be empty too

As you can see from the above,last 2 title is given wrong result.

This is my code:

import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = &quot;((?&lt;=\()\d{4}(?=[^\(]*$))&quot;
movies_DF=movies_DF.withColumn(&#39;yearOfRelease&#39;, regexp_extract(&quot;title&quot;, bracket_regexp + &quot;|(\d{4}$)&quot;, 0))
movies_DF.display(10000)

I am trying to get the year portion of the title string.

答案1

得分: 1

你可以尝试使用以下正则表达式：r'(?<=$)(\d+)(?=$)'，受到这个出色的答案的启发。

例如：

movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))', 1))

+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+

英文:

You can try using the following regex: r'(?<=$)(\d+)(?=$)', which is inspired by this excellent answer.

For example:

movies_DF = movies_DF.withColumn(&#39;uu&#39;, regexp_extract(col(&quot;title&quot;), r&#39;(?&lt;=\()(\d+)(?=\))&#39;,1))
 
+------------------------------------------------------------+----+
|title                                                       |uu  |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995)                                     |1995|
|Happy Anniversary (1959)                                    |1959|
|Paths (2017)                                                |2017|
|The Three Amigos - Outrageous! (2003)                       |2003|
|L&#39;obsession de l&#39;or (1906)                                  |1906|
|Babe Ruth Story, The (1948)                                 |1948|
|11&#39;0901 - September 11 (2002)                               |2002|
|Blood Trails (2006)                                         |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009)                                      |2009|
+------------------------------------------------------------+----+

答案2

得分: 1

以下是已翻译的内容：

实际上，以下的正则表达式模式似乎在起作用：

(?<=[( ])\d{4}(?=\S*\)|$)

这里是一个工作中的正则表达式演示。

更新后的PySpark代码：

bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)

这个正则表达式模式的作用是匹配：

(?<=[( ]) 断言之前的内容是(或空格
\d{4} 匹配一个4位数的年份
(?=\S*\)|$) 断言之后跟随)，可能前面有非空白字符，或者跟随字符串的末尾。

英文:

Empirically, the following regex pattern seems to be working:

(?&lt;=[( ])\d{4}(?=\S*\)|$)

Here is a working regex demo.

Updated PySpark code:

bracket_regexp = &quot;((?&lt;=[( ])\d{4}(?=\S*\)|$))&quot;
movies_DF = movies_DF.withColumn(&#39;yearOfRelease&#39;, regexp_extract(&quot;title&quot;, bracket_regexp + &quot;|(\d{4}$)&quot;, 0))
movies_DF.display(10000)

The regex pattern works by matching:

(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows

答案3

得分: 0

你的正则表达式只能用于第一行。$\d{4}$ 尝试匹配一个(，4位数字和一个)。对于第一行的(1995) 是正确的。其他行没有包含这种模式。

在你的情况下，我们可以使用回顾后顾和前瞻模式来检测括号内的日期。(?<=$) 表示前面是一个开括号。(?=–|(â€“)|$) 表示后面是一个闭括号，或者â€“或–，这是原始字符被错误编码的情况。一旦你涵盖了括号之间的日期，你可以涵盖没有括号的字符串末尾的日期：\d{4}$。

import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(&#226;€“)|\)))"
movies_DF\
    .withColumn('uu', regexp_extract("title", bracket_regex + "|\d{4}$", 0))\
    .show(truncate=False)

+------------------------------------------------------+-------------+
|title                                                 |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995)                               |1995         |
|Death Note: Desu n&#195;&#180;to (2006&#226;€“2007)                  |2006         |
|Irwin &amp; Fran 2013                                     |2013         |
|9500 Liberty (2009)                                   |2009         |
|test 1234 test 4567                                   |4567         |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952         |
|The Garden of Afflictions 2017                        |2017         |
|The Naked Truth (1957) (Your Past Is Showing)         |1957         |
|Conquest 1453 (Fetih 1453) (2012)                     |2012         |
|Commune, La (Paris, 1871) (2000)                      |2000         |
|1013 Briar Lane                                       |             |
+------------------------------------------------------+-------------+

另外，在将正则表达式传递给Spark函数时，不需要在字符串前加上r前缀。

英文:

Your regex can only work for the first line. $\d{4}$ tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.

In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=$) means an open bracket before. (?=–|(â€“)|$) means a closing bracket after, or â€“ or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.

import pyspark.sql.functions as F
bracket_regexp = &quot;((?&lt;=\()\d{4}(?=–|(&#226;€“)|\)))&quot;
movies_DF\
    .withColumn(&#39;uu&#39;, regexp_extract(&quot;title&quot;, bracket_regex + &quot;|(\d{4}$)&quot;, 0))\
    .show(truncate=False)

+------------------------------------------------------+-------------+
|title                                                 |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995)                               |1995         |
|Death Note: Desu n&#195;&#180;to (2006&#226;€“2007)                  |2006         |
|Irwin &amp; Fran 2013                                     |2013         |
|9500 Liberty (2009)                                   |2009         |
|test 1234 test 4567                                   |4567         |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952         |
|The Garden of Afflictions 2017                        |2017         |
|The Naked Truth (1957) (Your Past Is Showing)         |1957         |
|Conquest 1453 (Fetih 1453) (2012)                     |2012         |
|Commune, La (Paris, 1871) (2000)                      |2000         |
|1013 Briar Lane                                       |             |
+------------------------------------------------------+-------------+

Also you do not need to prefix the string with r when you pass a regex to a spark function.

答案4

得分: 0

以下是翻译好的部分：

"Here is a regexp that would work:"

df = df.withColumn("year", F.regexp_extract("title", "(?:[\\s\\(])(\\d{4})(?:[–\\)])?", 1))

"Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,"

英文:

Here is a regexp that would work:

df = df.withColumn(&quot;year&quot;, F.regexp_extract(&quot;title&quot;,  &quot;(?:[\s\(])(\d{4})(?:[&#226;€“\)])?&quot;, 1))

Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用正则表达式提取字符串中的年份，使用pyspark的regexp_extract函数。

问题

答案1

答案2

答案3

答案4

如何向表中添加一个触发器以删除重复项

如何从切片中填充函数参数？

Liquibase双向关系变更日志顺序

SSRS Lookupset在一个单元格中显示与账户关联的每个项目

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。