2023年7月6日 16:19:11go评论186阅读模式

英文:

Parse a line containing a sequence of entries, inferring data

问题

我尝试了每种可能的模式，但仍然很难。可以帮忙吗？

所以我有这个字符串：'C+11.,18.,25.6.,2.,23.7.,27.8.23'

我想要我的正则表达式提取日期数字（它们位于每个月之前），并且仅与特定月份相关联。或者位于两个月之间，只属于下个月的日期。

示例：

案例1：在月份6之前的6月份的日期|

案例2：月份7之后的月份6的日期。月份被定义为：r'\.\d{1}\.' 例如，这里的6、7和8是月份，它们之前的任何数字都是它们的日期。

代码：

import re
captured_pattern = 'C+11.,18.,25.6.,2.,23.7.,27.8.23'
pattern = r'(\d{1,2})\.(?=(\d{1})\.(?!.*,\d{1,}\.6))'
matches = re.findall(pattern, captured_pattern)
print(matches)

输出：[('25', '6'), ('23', '7'), ('27', '8')]

日期的定义如下：

例如：11、18、25是6月的日期。2、23是7月的日期。

年份在最后部分以两位数字格式定义。23

提前感谢您的帮助

英文:

I tried every possible pattern, but still finding difficulty. Any help please?

So i have this string : 'C+11.,18.,25.6.,2.,23.7.,27.8.23'

I want my regex to extract the days numbers ( they are situated before every month), and also that are only related to that specific month. Or days in between two months that belong only to the month after days.

Example:

Case1: days of month 6 that are before it |

Case2: days of month 7 that are after month 6. The month is defined as: *r'\.\d{1}\\.'* For example 6, 7, and 8 are months here. and any number before them is its days.

Code:

import re
captured_pattern = &quot;C+11.,18.,25.6.,2.,23.7.,27.8.23&quot;
pattern = r&#39;(\d{1,2})\.(?=(\d{1})\.(?!.*,\d{1,}\.6))&#39;
matches = re.findall(pattern, captured_pattern)
print(matches)
OUTPUT : [(&#39;25&#39;, &#39;6&#39;), (&#39;23&#39;, &#39;7&#39;), (&#39;27&#39;, &#39;8&#39;)]

The days are defined as:

Example: 11.,18.,25 are days of the month 6 (juin). 2.,23 are days of month 7 (july).

The year is defined in the last part as two-digit format. 23

Thank you in advance

答案1

得分: 2

以下是翻译好的部分：

你可以将文本部分与其相应的月份匹配，并使用以下代码：

import re
s = "C+11.,18.,25.6.,2.,23.7.,27.8.23"
rx = r'\b(\d{1,2}(?:\.,\d+)*)\.(\d{1,2})\b'
results = [(days.split('.,'), month,) for days, month in re.findall(rx, s)]
print(results)
# => [(['11', '18', '25'], '6'), (['2', '23'], '7'), (['27'], '8')]

请参阅Python演示。请参阅正则表达式演示。请注意，天数的正则表达式部分可以定义为(0?[1-9]|[12]\d|3[01])，月份的正则表达式部分可以进一步精确为(0?[1-9]|1[0-2])。

正则表达式详细信息：

\b - 单词边界
(\d{1,2}(?:\.,\d+)*) - 第1组：一到两位数字，然后是零个或多个., + 一位或多位数字
\. - 一个.字符
(\d{1,2}) - 第2组（月份）：一到两位数字
\b - 单词边界

英文:

You can match the text portions with days up to their respective month and use

import re
s=&quot;C+11.,18.,25.6.,2.,23.7.,27.8.23&quot;
rx = r&#39;\b(\d{1,2}(?:\.,\d+)*)\.(\d{1,2})\b&#39;
results = [(days.split(&#39;.,&#39;),month,) for days, month in re.findall(rx, s)]
print(results)
# =&gt; [([&#39;11&#39;, &#39;18&#39;, &#39;25&#39;], &#39;6&#39;), ([&#39;2&#39;, &#39;23&#39;], &#39;7&#39;), ([&#39;27&#39;], &#39;8&#39;)]

See the Python demo. See the regex demo. Note that the days regex part can be defined as (0?[1-9]|[12]\d|3[01]) and the month regex part can be further precised as (0?[1-9]|1[0-2]).

Regex details:

\b - a word boundary
(\d{1,2}(?:\.,\d+)*) - Group 1: one or two digits, and then zero or more repetitions of ., + one or more digits
\. - a . char
(\d{1,2}) - Group 2 (month): one or two digits
\b - a word boundary

答案2

得分: 1

import re
string = "C+11.,18.,25.6.,2.,23.7.,27.8.23"
# 删除除数字、逗号和句点之外的任何内容
cleaned = re.sub(r'[^\d,.]+', '', string)
matches = {}
days = []
for part in cleaned.split(','):
    day, month, *_ = part.split('.')
    days.append(int(day))
    if month:
        matches[int(month)] = days
        days = {}

Result:

{6: [11, 18, 25], 7: [2, 23], 8: [27]}

英文:

import re
string = &quot;C+11.,18.,25.6.,2.,23.7.,27.8.23&quot;
# Remove anything that is not a digit, comma or period
cleaned = re.sub(r&#39;[^\d,.]+&#39;, &#39;&#39;, string)
matches = {}
days = []
for part in cleaned.split(&#39;,&#39;):
    day, month, *_ = part.split(&#39;.&#39;)
    days.append(int(day))
    if month:
        matches[int(month)] = days
        days = []

Result:

{6: [11, 18, 25], 7: [2, 23], 8: [27]}

答案3

得分: 1

如果解析任务对于单个正则表达式来说太复杂，可以使用自定义解析器，并将正则表达式用于更简单的部分。

正则表达式适用于解析简单的数据结构，但不适用于解析数据序列（也无法解析像任意深度嵌套这样的递归结构）。它们还无法执行应用程序特定方式推断数据部分的任何自定义处理逻辑。

import re
from dataclasses import dataclass
@dataclass
class MutableDate:
    day = None
    month = None
    year = None
def parse(data: str):
  # 分离前缀
  data = re.match(r'[^+]+\+(.*)',data).group(1)
  # 为每个条目创建一个正则表达式
  entry_re = re.compile(r'(?P<day>\d+)\.(?:(?P<month>\d+)\.(?P<year>\d+)?)?')
  # 将序列解析为单独的条目
  entries=data.split(',')
  # 解析单个条目
  result=[]
  for entry in entries:
    m = entry_re.match(entry)
    result.append(MutableDate(**m.groupdict()))
  # 使用您的自定义逻辑处理解析的数据
  #（推断日期的缺失部分）
  last_month_entry = last_year_entry = None
  for i,entry in enumerate(result):
    current_month = entry.month
    if current_month != None:
      for earlier_entry in result[i-1:last_month_entry:-1]:
        earlier_entry.month = current_month
      last_month_entry = i
    current_year = entry.year
    if current_year != None:
       for earlier_entry in result[i-1:last_year_entry:-1]:
         earlier_entry.year = current_year
       last_year_entry = i
  # 可选择将结果中的所有条目转换为`datetime.date`。
  # 这将具有验证日期的有用副作用
  return result

对于您的数据，这将返回：

[MutableDate(day='11', month='6', year='23'),
 MutableDate(day='18', month='6', year='23'),
 MutableDate(day='25', month='6', year='23'),
 MutableDate(day='2', month='7', year='23'),
 MutableDate(day='23', month='7', year='23'),
 MutableDate(day='27', month='8', year='23')]

英文:

If the parsing task is too complex for a single regex, use a custom parser and relegate regexes to simpler parts.

Regexes are good for parsing simple data structures but are not good for parsing data sequences (and plain cannot parse recursive structures like arbitrarily-deep nesting). They are also unable to do any custom processing logic like inferring parts of data in an application-specific way.

import re
from dataclasses import dataclass
@dataclass
class MutableDate:
    day = None
    month = None
    year = None
def parse(data: str):
  # split off the prefix
  data = re.match(r&#39;[^+]+\+(.*)&#39;,data).group(1)
  # a regex for each entry
  entry_re = re.compile(r&#39;(?P&lt;day&gt;\d+)\.(?:(?P&lt;month&gt;\d+)\.(?P&lt;year&gt;\d+)?)?&#39;)
  # parse the sequence into separate entries
  entries=data.split(&#39;,&#39;)
  # parse individual entries
  result=[]
  for entry in entries:
    m = entry_re.match(entry)
    result.append(MutableDate(**m.groupdict()))
  # process the parsed data with your custom logic
  # (infer missing parts of dates)
  last_month_entry = last_year_entry = None
  for i,entry in enumerate(result):
    current_month = entry.month
    if current_month != None:
      for earlier_entry in result[i-1:last_month_entry:-1]:
        earlier_entry.month = current_month
      last_month_entry = i
    current_year = entry.year
    if current_year != None:
       for earlier_entry in result[i-1:last_year_entry:-1]:
         earlier_entry.year = current_year
       last_year_entry = i
  # optionally convert all entries in the result to `datetime.date`.
  # this would have a useful side effect of validating the dates
  return result

For your data, this returns:

[MutableDate(day=&#39;11&#39;, month=&#39;6&#39;, year=&#39;23&#39;),
 MutableDate(day=&#39;18&#39;, month=&#39;6&#39;, year=&#39;23&#39;),
 MutableDate(day=&#39;25&#39;, month=&#39;6&#39;, year=&#39;23&#39;),
 MutableDate(day=&#39;2&#39;, month=&#39;7&#39;, year=&#39;23&#39;),
 MutableDate(day=&#39;23&#39;, month=&#39;7&#39;, year=&#39;23&#39;),
 MutableDate(day=&#39;27&#39;, month=&#39;8&#39;, year=&#39;23&#39;)]

答案4

得分: 1

以下是您要翻译的内容：

这是我的建议：
import re
captured_pattern = "C+11.,18.,25.6.,2.,23.7.,27.8.23"
matches = [tuple(f"{v}.{value}".split(".")) for key,value in re.findall(r"\b(\d{1,2}(?:\.,\d+)*|(?:\.,\d+)\.\d)\.(\d{1,2})\b", captured_pattern) for v in key.split(".,") if v]
print(matches)
结果：
[('11', '6'), ('18', '6'), ('25', '6'), ('2', '7'), ('23', '7'), ('27', '8', '23')]
链接到 [RegEx][1]

英文:

Here's my suggestion:

import re
captured_pattern = &quot;C+11.,18.,25.6.,2.,23.7.,27.8.23&quot;
matches = [tuple(f&quot;{v}.{value}&quot;.split(&quot;.&quot;)) for key,value in re.findall(r&quot;\b(\d{1,2}(?:\.,\d+)*|(?:\.,\d+)\.\d)\.(\d{1,2})\b&quot;, captured_pattern) for v in key.split(&quot;.,&quot;) if v]
print(matches)

Result:

[(&#39;11&#39;, &#39;6&#39;), (&#39;18&#39;, &#39;6&#39;), (&#39;25&#39;, &#39;6&#39;), (&#39;2&#39;, &#39;7&#39;), (&#39;23&#39;, &#39;7&#39;), (&#39;27&#39;, &#39;8&#39;, &#39;23&#39;)]

Link to RegEx

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

解析包含一系列条目的行，推断数据

问题

答案1

答案2

答案3

答案4

找出数据集中的最大组合数。

如何将外键的值自动添加到多对多字段？

Numba：无法确定嵌套函数的类型

从非空行获取值 XLSXWRITER（Python）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论