2023年6月29日 18:28:11go评论79阅读模式

英文:

How to extract date from a specified column containing different types of date formats of a given Pandas DataFrame using Regex

问题

Here's the translated code portion:

def find_valid_dates(dt):
    result = re.findall(r"\d{1,2}-\d{2}-\d{2,4}|\d{1,2}\s+(?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)\s+\d{1,4}", dt)
    return result

SaaOne_msi_vervangen['valid_dates'] = SaaOne_msi_vervangen['Oplossingstekst'].apply(lambda dt: find_valid_dates(dt))

Regarding your question, the issue with not extracting "2 oktober 2023" might indeed be related to the white spaces. The regular expression provided should work correctly for dates with white spaces like "2 oktober 2023." Make sure there are no extra characters or formatting issues in your input data.

英文:

def find_valid_dates(dt):
    result = re.findall(&quot;\d{1,2}-\d{2}-\d{2,4}|\d{1,2} (?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december) \d{1,4}&quot;, dt)
    # result = re.findall(&quot;\d{2}-\d{2}-\d{4}|[a-zA-Z]+\s+\d{4}&quot;,dt)
    return result

SaaOne_msi_vervangen[&#39;valid_dates&#39;]=SaaOne_msi_vervangen[&#39;Oplossingstekst&#39;].apply(lambda dt : find_valid_dates(dt))

The column "Oplossingstekst" of my dataframe SaaOne_msi_vervangen contains multiple dates in different format. For example: 14-06-2020 and 2 oktober 2023. I tried to extract both dates using the or operator in my findall, but thus far this code doesn't extract 2 oktober 2023. It is maybe related to the white spaces. How can I solve this?

答案1

得分: 0

你可能遇到的空格问题

我个人建议将空格 " " 替换为 \s 或 \s+。这样，你可以匹配各种空格（包括换行符）。但你也可以更具体地将其替换为水平空白字符 = \h（在Python中似乎不可用，但等同于[\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000} ]）。列表可以缩减。由你决定是否匹配一次或多次。

带捕获组的正则表达式

由于你可能需要稍后解析日期，让我们在一些命名捕获组中捕获日期、月份和年份。我建议这样做：

regex = r&quot;&quot;&quot;
	\b # 单词边界
	(?: # 非捕获组，用于“或”
	  # 简短表示法：14-06-2022，1-05-23
	  (?P&lt;short&gt;
	    (?P&lt;short_day&gt;\d{1,2})
	    -
	    (?P&lt;short_month&gt;\d{2})
	    -
	    (?P&lt;short_year&gt;\d{2}|\d{4})
	  )
	  | # 或
	  # 文本表示法：2 oktober 2023，31 december 23
	  (?P&lt;text&gt;
	    (?P&lt;text_day&gt;\d{1,2}) # 日
	    \s+ # 空白字符
	    (?P&lt;text_month&gt;
	      januari|februari|maart|april|mei|juni|juli|
	      augustus|september|oktober|november|december
	    )
	    \s+ # 空白字符
	    (?P&lt;text_year&gt;\d{2}|\d{4}) # 2位或4位数字年份，但不是3位。
	  )
	)
	\b # 单词边界
	&quot;&quot;&quot;
matches = re.finditer(regex, test_str, re.VERBOSE | re.IGNORECASE)

我使用了这些标志：

x = re.VERBOSE。扩展/详细标志允许你在正则表达式中添加一些注释。
i = re.IGNORECASE

对于年份，我认为\d{2,4} 不是最佳选择，因为它会匹配3位数字，这不是一个有效的年份值。我将其替换为\d{2}|\d{4}。

我还添加了单词边界 \b 以避免匹配“1-06-123456”中的一部分，这可能是产品ID或其他内容。

你可以在regex101上测试这个正则表达式并使用Code Generator来测试Python代码。

英文:

The space problem you may have

I would personally replace the space " " by \s or \s+. This way, you
can match all kind of spaces (and new lines). But you could be more restrictive
and replace it by horizontal whitespace chars = \h (seems not available in Python, but equivalent to
[\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000} ]). The list could be reduced. Up to you to decide if you match it once or more than once.

A regex with capturing groups

As you are probably having to parse the date later, let's capture the day,
month and year in some named capturing groups. I would suggest this:

regex = r&quot;&quot;&quot;
	\b # word boundary
	(?: # non-capturing group for the &quot;or&quot;
	  # Short notation: 14-06-2022, 1-05-23
	  (?P&lt;short&gt;
	    (?P&lt;short_day&gt;\d{1,2})
	    -
	    (?P&lt;short_month&gt;\d{2})
	    -
	    (?P&lt;short_year&gt;\d{2}|\d{4})
	  )
	  | # Or
	  # Text notation: 2 oktober 2023, 31 december 23
	  (?P&lt;text&gt;
	    (?P&lt;text_day&gt;\d{1,2}) # day
	    \s+ # white spaces
	    (?P&lt;text_month&gt;
	      januari|februari|maart|april|mei|juni|juli|
	      augustus|september|oktober|november|december
	    )
	    \s+ # white spaces
	    (?P&lt;text_year&gt;\d{2}|\d{4}) # year with 2 or 4 digits, but not 3.
	  )
	)
	\b # word boundary
	&quot;&quot;&quot;
matches = re.finditer(regex, test_str, re.VERBOSE | re.IGNORECASE)

I used these flags:

x = re.VERBOSE. The extended/verbose flag lets you put some
comments in your regex.
i = re.IGNORECASE

For the year, I think that \d{2,4} isn't the best, as it would match 3 digits, not really a valid year value. I replaced it by \d{2}|\d{4}.

I also added the word boundaries \b around to avoid matching a part of
"1-06-123456" which could be a product id or whatever else.

You can play with this regex101 and
use the Code Generator to test the Python code.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to extract date from a specified column containing different types of date formats of a given Pandas DataFrame using Regex

问题

答案1

你可能遇到的空格问题

带捕获组的正则表达式

The space problem you may have

A regex with capturing groups

Push and Pop madness – Python未找到目标项

Python中的typing – 用于继承多个基类的TypeVar

提取 Pandas Python 中的日期时间数据框的行。

Manipulating a given DataFrame in order to recreate it in a different structure, Pandas Python

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论