How to extract date from a specified column containing different types of date formats of a given Pandas DataFrame using Regex

huangapple go评论60阅读模式
英文:

How to extract date from a specified column containing different types of date formats of a given Pandas DataFrame using Regex

问题

Here's the translated code portion:

def find_valid_dates(dt):
    result = re.findall(r"\d{1,2}-\d{2}-\d{2,4}|\d{1,2}\s+(?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)\s+\d{1,4}", dt)
    return result

SaaOne_msi_vervangen['valid_dates'] = SaaOne_msi_vervangen['Oplossingstekst'].apply(lambda dt: find_valid_dates(dt))

Regarding your question, the issue with not extracting "2 oktober 2023" might indeed be related to the white spaces. The regular expression provided should work correctly for dates with white spaces like "2 oktober 2023." Make sure there are no extra characters or formatting issues in your input data.

英文:
def find_valid_dates(dt):
    result = re.findall("\d{1,2}-\d{2}-\d{2,4}|\d{1,2} (?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december) \d{1,4}", dt)
    # result = re.findall("\d{2}-\d{2}-\d{4}|[a-zA-Z]+\s+\d{4}",dt)
    return result

SaaOne_msi_vervangen['valid_dates']=SaaOne_msi_vervangen['Oplossingstekst'].apply(lambda dt : find_valid_dates(dt))

The column "Oplossingstekst" of my dataframe SaaOne_msi_vervangen contains multiple dates in different format. For example: 14-06-2020 and 2 oktober 2023. I tried to extract both dates using the or operator in my findall, but thus far this code doesn't extract 2 oktober 2023. It is maybe related to the white spaces. How can I solve this?

答案1

得分: 0

你可能遇到的空格问题

我个人建议将空格 " " 替换为 \s\s+。这样,你可以匹配各种空格(包括换行符)。但你也可以更具体地将其替换为水平空白字符 = \h(在Python中似乎不可用,但等同于[\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000} ])。列表可以缩减。由你决定是否匹配一次或多次。

带捕获组的正则表达式

由于你可能需要稍后解析日期,让我们在一些命名捕获组中捕获日期、月份和年份。我建议这样做:

regex = r"""
	\b # 单词边界
	(?: # 非捕获组,用于“或”
	  # 简短表示法:14-06-2022,1-05-23
	  (?P<short>
	    (?P<short_day>\d{1,2})
	    -
	    (?P<short_month>\d{2})
	    -
	    (?P<short_year>\d{2}|\d{4})
	  )
	  | # 或
	  # 文本表示法:2 oktober 2023,31 december 23
	  (?P<text>
	    (?P<text_day>\d{1,2}) # 日
	    \s+ # 空白字符
	    (?P<text_month>
	      januari|februari|maart|april|mei|juni|juli|
	      augustus|september|oktober|november|december
	    )
	    \s+ # 空白字符
	    (?P<text_year>\d{2}|\d{4}) # 2位或4位数字年份,但不是3位。
	  )
	)
	\b # 单词边界
	"""
matches = re.finditer(regex, test_str, re.VERBOSE | re.IGNORECASE)

我使用了这些标志:

  • x = re.VERBOSE。扩展/详细标志允许你在正则表达式中添加一些注释。

  • i = re.IGNORECASE

对于年份,我认为\d{2,4} 不是最佳选择,因为它会匹配3位数字,这不是一个有效的年份值。我将其替换为\d{2}|\d{4}

我还添加了单词边界 \b 以避免匹配“1-06-123456”中的一部分,这可能是产品ID或其他内容。

你可以在regex101上测试这个正则表达式 并使用Code Generator来测试Python代码。

英文:

The space problem you may have

I would personally replace the space " " by \s or \s+. This way, you
can match all kind of spaces (and new lines). But you could be more restrictive
and replace it by horizontal whitespace chars = \h (seems not available in Python, but equivalent to
[\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000} ]). The list could be reduced. Up to you to decide if you match it once or more than once.

A regex with capturing groups

As you are probably having to parse the date later, let's capture the day,
month and year in some named capturing groups. I would suggest this:

regex = r"""
	\b # word boundary
	(?: # non-capturing group for the "or"
	  # Short notation: 14-06-2022, 1-05-23
	  (?P<short>
	    (?P<short_day>\d{1,2})
	    -
	    (?P<short_month>\d{2})
	    -
	    (?P<short_year>\d{2}|\d{4})
	  )
	  | # Or
	  # Text notation: 2 oktober 2023, 31 december 23
	  (?P<text>
	    (?P<text_day>\d{1,2}) # day
	    \s+ # white spaces
	    (?P<text_month>
	      januari|februari|maart|april|mei|juni|juli|
	      augustus|september|oktober|november|december
	    )
	    \s+ # white spaces
	    (?P<text_year>\d{2}|\d{4}) # year with 2 or 4 digits, but not 3.
	  )
	)
	\b # word boundary
	"""
matches = re.finditer(regex, test_str, re.VERBOSE | re.IGNORECASE)

I used these flags:

  • x = re.VERBOSE. The extended/verbose flag lets you put some
    comments in your regex.

  • i = re.IGNORECASE

For the year, I think that \d{2,4} isn't the best, as it would match 3 digits, not really a valid year value. I replaced it by \d{2}|\d{4}.

I also added the word boundaries \b around to avoid matching a part of
"1-06-123456" which could be a product id or whatever else.

You can play with this regex101 and
use the Code Generator to test the Python code.

huangapple
  • 本文由 发表于 2023年6月29日 18:28:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76580186.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定