使用正则表达式提取字符串中的年份,使用pyspark的regexp_extract函数。

huangapple go评论69阅读模式
英文:

extracting year from string using regexp_extract pyspark

问题

这是我的结果的一部分:

1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
   <--最后一个标题的部分是空的,也应该是空的
英文:

This is the portion of my result :

Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing) 
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane

return:

1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
   <--this part for last title is empty and supposed to be empty too

As you can see from the above,last 2 title is given wrong result.

This is my code:

import pyspark.sql.functions as F

from pyspark.sql.functions import regexp_extract,col

bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"

movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))

movies_DF.display(10000)

I am trying to get the year portion of the title string.

答案1

得分: 1

你可以尝试使用以下正则表达式:r'(?<=\()(\d+)(?=\))',受到这个出色的答案的启发。

例如:

movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))', 1))

+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+

英文:

You can try using the following regex: r&#39;(?&lt;=\()(\d+)(?=\))&#39;, which is inspired by this excellent answer.

For example:

movies_DF = movies_DF.withColumn(&#39;uu&#39;, regexp_extract(col(&quot;title&quot;), r&#39;(?&lt;=\()(\d+)(?=\))&#39;,1))
 
+------------------------------------------------------------+----+
|title                                                       |uu  |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995)                                     |1995|
|Happy Anniversary (1959)                                    |1959|
|Paths (2017)                                                |2017|
|The Three Amigos - Outrageous! (2003)                       |2003|
|L&#39;obsession de l&#39;or (1906)                                  |1906|
|Babe Ruth Story, The (1948)                                 |1948|
|11&#39;0901 - September 11 (2002)                               |2002|
|Blood Trails (2006)                                         |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009)                                      |2009|
+------------------------------------------------------------+----+

答案2

得分: 1

以下是已翻译的内容:

实际上,以下的正则表达式模式似乎在起作用:

(?<=[( ])\d{4}(?=\S*\)|$)

这里是一个工作中的正则表达式演示

更新后的PySpark代码:

bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)

这个正则表达式模式的作用是匹配:

  • (?<=[( ]) 断言之前的内容是(或空格
  • \d{4} 匹配一个4位数的年份
  • (?=\S*\)|$) 断言之后跟随),可能前面有非空白字符,或者跟随字符串的末尾。
英文:

Empirically, the following regex pattern seems to be working:

(?&lt;=[( ])\d{4}(?=\S*\)|$)

Here is a working regex demo.

Updated PySpark code:

bracket_regexp = &quot;((?&lt;=[( ])\d{4}(?=\S*\)|$))&quot;
movies_DF = movies_DF.withColumn(&#39;yearOfRelease&#39;, regexp_extract(&quot;title&quot;, bracket_regexp + &quot;|(\d{4}$)&quot;, 0))
movies_DF.display(10000)

The regex pattern works by matching:

  • (?&lt;=[( ]) assert that what precedes is ( or a space
  • \d{4} match a 4 digit year
  • (?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
    OR the end of the string follows

答案3

得分: 0

你的正则表达式只能用于第一行。\(\d{4}\) 尝试匹配一个(,4位数字和一个)。对于第一行的(1995) 是正确的。其他行没有包含这种模式。

在你的情况下,我们可以使用回顾后顾和前瞻模式来检测括号内的日期。(?<=\() 表示前面是一个开括号。(?=–|(&#226;€“)|\)) 表示后面是一个闭括号,或者&#226;€“,这是原始字符被错误编码的情况。一旦你涵盖了括号之间的日期,你可以涵盖没有括号的字符串末尾的日期:\d{4}$

import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(&#226;€“)|\)))"

movies_DF\
    .withColumn('uu', regexp_extract("title", bracket_regex + "|\d{4}$", 0))\
    .show(truncate=False)
+------------------------------------------------------+-------------+
|title                                                 |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995)                               |1995         |
|Death Note: Desu n&#195;&#180;to (2006&#226;€“2007)                  |2006         |
|Irwin &amp; Fran 2013                                     |2013         |
|9500 Liberty (2009)                                   |2009         |
|test 1234 test 4567                                   |4567         |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952         |
|The Garden of Afflictions 2017                        |2017         |
|The Naked Truth (1957) (Your Past Is Showing)         |1957         |
|Conquest 1453 (Fetih 1453) (2012)                     |2012         |
|Commune, La (Paris, 1871) (2000)                      |2000         |
|1013 Briar Lane                                       |             |
+------------------------------------------------------+-------------+

另外,在将正则表达式传递给Spark函数时,不需要在字符串前加上r前缀。

英文:

Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.

In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?&lt;=\() means an open bracket before. (?=–|(&#226;€“)|\)) means a closing bracket after, or &#226;€“ or which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.

import pyspark.sql.functions as F
bracket_regexp = &quot;((?&lt;=\()\d{4}(?=|(&#226;€“)|\)))&quot;

movies_DF\
    .withColumn(&#39;uu&#39;, regexp_extract(&quot;title&quot;, bracket_regex + &quot;|(\d{4}$)&quot;, 0))\
    .show(truncate=False)
+------------------------------------------------------+-------------+
|title                                                 |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995)                               |1995         |
|Death Note: Desu n&#195;&#180;to (2006&#226;€“2007)                  |2006         |
|Irwin &amp; Fran 2013                                     |2013         |
|9500 Liberty (2009)                                   |2009         |
|test 1234 test 4567                                   |4567         |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952         |
|The Garden of Afflictions 2017                        |2017         |
|The Naked Truth (1957) (Your Past Is Showing)         |1957         |
|Conquest 1453 (Fetih 1453) (2012)                     |2012         |
|Commune, La (Paris, 1871) (2000)                      |2000         |
|1013 Briar Lane                                       |             |
+------------------------------------------------------+-------------+

Also you do not need to prefix the string with r when you pass a regex to a spark function.

答案4

得分: 0

以下是翻译好的部分:

"Here is a regexp that would work:"

df = df.withColumn("year", F.regexp_extract("title", "(?:[\\s\\(])(\\d{4})(?:[–\\)])?", 1))

"Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,"

英文:

Here is a regexp that would work:

df = df.withColumn(&quot;year&quot;, F.regexp_extract(&quot;title&quot;,  &quot;(?:[\s\(])(\d{4})(?:[&#226;€“\)])?&quot;, 1))

Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.

huangapple
  • 本文由 发表于 2023年2月7日 03:50:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75365926.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定