如何在Python中从同一字符串中提取多个时间?

huangapple go评论76阅读模式
英文:

How to extract multiple time from same string in Python?

问题

我理解你的请求,以下是代码部分的中文翻译:

我正在尝试从单个字符串中提取时间,其中一个字符串中除了时间之外还包含其他文本。例如,s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58'

我尝试使用 datefinder 模块,如下所示:

from datetime import datetime as dt
import datefinder as dfn
for m in dfn.find_dates(s):
    print(dt.strftime(m, "%H:%M:%S"))

这给我这个结果:

17:58:00

在这种情况下,时间 "06:00" 被忽略了。现在如果我不使用 datefinder,只使用 datetime 模块尝试,如下所示:

dt.strftime(s, "%H:%M")

它会通知我输入必须是一个已经存在的 datetime 对象,而不是一个字符串,并显示以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'str'

因此,我尝试使用 dateutil 模块将此字符串 s 解析为 datetime 对象,如下所示:

from dateutil.parser import parse
parse(s)

但是,现在它提示我的字符串格式不正确(在大多数情况下不会有任何固定格式),显示以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1358, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
    raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58')

我考虑过使用正则表达式来获取时间,如下所示:

import re
p = r"\d{2}:\d{2}"
times = [i.group() for i in re.finditer(p, s)]
# 给我 [‘06:00’, ‘17:58’]

但是这样做需要我再次检查这个正则表达式匹配的块是否真的是时间,因为即使 "99:99" 也可以正确匹配正则表达式,并且可能错误地被认为是时间。是否有任何方法可以在不使用正则表达式的情况下从单个字符串中获取所有时间?

请注意,字符串可能包含或不包含任何日期,但它将始终包含一个时间。即使包含日期,日期格式可能是任何格式,而且这个字符串可能包含或不包含其他无关的文本。

英文:

I'm trying to extract time from single strings where in one string there will be texts other than only time. An example is s = &#39;Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58&#39;.

I've tried using datefinder module like this :

from datetime import datetime as dt
import datefinder as dfn
for m in dfn.find_dates(s):
    print(dt.strftime(m, &quot;%H:%M:%S&quot;))

Which gives me this :

17:58:00

In this case the time &quot;06:00&quot; is missed out. Now if I try without datefinder with only datetime module like this :

dt.strftime(s, &quot;%H:%M&quot;)

It notifies me that the input must be a datetime object already, not a string with the following error :

Traceback (most recent call last):
  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
TypeError: descriptor &#39;strftime&#39; requires a &#39;datetime.date&#39; object but received a &#39;str&#39;

So I tried to use dateutil module to parse this string s to a datetime object with this :

from dateutil.parser import parse
parse(s)

but, now it now says that my string is not in proper format (which in most cases will not be in any fixed format), showing me this error :

Traceback (most recent call last):
  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
  File &quot;/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py&quot;, line 1358, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File &quot;/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py&quot;, line 649, in parse
    raise ValueError(&quot;Unknown string format:&quot;, timestr)
ValueError: (&#39;Unknown string format:&#39;, &#39;12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58&#39;)

I have thought of getting the time with regex like

import re
p = r&quot;\d{2}\:\d{2}&quot;
times = [i.group() for i in re.finditer(p, s)]
# Gives me [&#39;06:00&#39;, &#39;17:58&#39;]

But doing this way will need me to check again whether this regex matched chunks are actually time or not because even &quot;99:99&quot; could be regex matched rightly and told as time wrongly. Is there any work around without regex to get all the times from a single string?

Please note that the string might contain or might not contain any date, but it will contain a time always. Even if it contains date, the date format might be anything on earth and also this string might or might not contain other irrelevant texts.

答案1

得分: 1

我在这里看不到很多选项,所以我会选择一种启发式方法。我将运行以下代码对整个数据集进行处理,并扩展配置/正则表达式,直到它覆盖了所有/大多数情况:

import re
import logging
from datetime import datetime as dt

s = 'Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59'

SUPPORTED_DATE_FMTS = {
    re.compile(r"(\d{2}/\w{3}/\d{4})"): "%d/%b/%Y",
    re.compile(r"(\d{2}/\d{2}/\d{4})"): "%d/%m/%Y",
    re.compile(r"(\d{2}/\w{3}\w+/\d{4})"): "%d/%B/%Y",
    # 在这里添加更多
}

SUPPORTED_TIME_FMTS = {
    re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]"): "%H:%M",
    re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])"): "%H:%M:%S",
    # 在这里添加更多
}

def extract_supported_dt(config, s):
    """
    通过给定的配置(键为正则表达式,值为日期/时间格式)循环处理,并尝试收集所有有效数据。
    """
    valid_data = []
    for regex, fmt in config.items():
        # 提取看起来像日期的部分
        valid_ish_data = regex.findall(s)
        if not valid_ish_data:
            continue
        print("检查 " + str(valid_ish_data))

        # 验证它
        for d in valid_ish_data:
            try:
                valid_data.append(dt.strptime(d, fmt))
            except ValueError:
                pass

    return valid_data

# 处理日期
dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s)
# 处理时间
times = extract_supported_dt(SUPPORTED_TIME_FMTS, s)

print("找到的日期:")
for date in dates:
    print("\t" + str(date.date()))

print("找到的时间:")
for t in times:
    print("\t" + str(t.time()))

示例输出:

检查 ['12/Jul/2019']
检查 ['12/08/2019']
检查 ['06:00']
检查 ['17:58:59']
找到的日期:
    2019-07-12
    2019-08-12
找到的时间:
    06:00:00
    17:58:59

这是一种反复尝试的方法,但我认为在您的情况下没有替代方法。因此,我的目标是尽可能简化支持更多日期/时间格式的扩展,而不是试图找到覆盖数据的100%解决方案。这样,您运行的数据越多,配置就会越完整。

需要注意的一点是,您将不得不检测似乎没有日期的字符串,并将其记录在某个地方。稍后您需要手动检查并查看是否遗漏了可以捕获的内容。

现在,假设您的数据是由另一个系统生成的,那么迟早您将能够匹配100%的数据。如果数据输入来自人类,那么您可能永远无法达到100%!(人们往往会拼写错误,有时导入随机内容...日期=今天 :))

英文:

I don't see many options here, so I would go with a heuristic. I would run the following against the whole dataset and extend the config/regexes until it covers all/most of the cases:

import re
import logging
from datetime import datetime as dt

s = &#39;Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59&#39;


SUPPORTED_DATE_FMTS = {
    re.compile(r&quot;(\d{2}/\w{3}/\d{4})&quot;): &quot;%d/%b/%Y&quot;,
    re.compile(r&quot;(\d{2}/\d{2}/\d{4})&quot;): &quot;%d/%m/%Y&quot;,
    re.compile(r&quot;(\d{2}/\w{3}\w+/\d{4})&quot;): &quot;%d/%B/%Y&quot;,
    # Capture more here
}

SUPPORTED_TIME_FMTS = {
    re.compile(r&quot;((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]&quot;): &quot;%H:%M&quot;,
    re.compile(r&quot;((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])&quot;): &quot;%H:%M:%S&quot;,
    # Capture more here
}


def extract_supported_dt(config, s):
    &quot;&quot;&quot;
    Loop thru the given config (keys are regexes, values are date/time format)
    and attempt to gather all valid data.
    &quot;&quot;&quot;
    valid_data = []
    for regex, fmt in config.items():
        # Extract what you think looks like date
        valid_ish_data = regex.findall(s)
        if not valid_ish_data:
            continue
        print(&quot;Checking &quot; + str(valid_ish_data))

        # validate it
        for d in valid_ish_data:
            try:
                valid_data.append(dt.strptime(d, fmt))
            except ValueError:
                pass

    return valid_data


# Handle dates
dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s)
# Handle times
times = extract_supported_dt(SUPPORTED_TIME_FMTS, s)

print(&quot;Found dates: &quot;)
for date in dates:
    print(&quot;\t&quot; + str(date.date()))

print(&quot;Found times: &quot;)
for t in times:
    print(&quot;\t&quot; + str(t.time()))

Example output:

Checking [&#39;12/Jul/2019&#39;]
Checking [&#39;12/08/2019&#39;]
Checking [&#39;06:00&#39;]
Checking [&#39;17:58:59&#39;]
Found dates:
	2019-07-12
	2019-08-12
Found times:
	06:00:00
	17:58:59

This is a trial and error approach but I do not think there is an alternative in your case. Thus my goal here is to make it as easy as possible to extend support with more date/time formats as opposed to try to find a solution that covers 100% of the data day-1. This way, the more data you run against the more complete your config will be.

One thing to note is that you will have to detect strings that appear to have no dates and log them somewhere. Later you will need to manually revise and see if something that was missed could be captured.

Now, assuming that your data are being generated by another system, sooner or later you will be able to match 100% of it. If the data input is from human, then you will probably never manage to get 100%! (people tend to make spelling mistakes and sometimes import random stuff... date=today 如何在Python中从同一字符串中提取多个时间? )

答案2

得分: 0

(?=[0-1])[0-1][0-9]\:[0-5][0-9]|(?=2)[2][0-3]\:[0-5][0-9] 匹配成功的时间格式包括:

00:00, 00:59, 01:00, 01:59, 02:00, 02:59, 09:00, 10:00, 11:59, 20:00, 21:59, 23:59

但不适用于以下格式:

99:99, 23:99, 01:99

在这里检查它是否适用于你

在Repl.it上检查

英文:

Use Regex But Something Like This,

(?=[0-1])[0-1][0-9]\:[0-5][0-9]|(?=2)[2][0-3]\:[0-5][0-9]

This Matched

00:00, 00:59 01:00 01:59 02:00 02: 59
09:00 10:00 11:59 20:00 21:59 23:59

Not work for

99:99 23:99 01:99

Check Here Dude if it works for You

Check on Repl.it

答案3

得分: 0

如何从同一字符串中提取多个时间 - Python中?

如果您只需要时间,此正则表达式应该可以正常工作

r"[0-2][0-9]\:[0-5][0-9]"

如果时间中可能包含空格,例如 23 : 59,请使用以下正则表达式

r"[0-2][0-9]\s*\:\s*[0-5][0-9]"
英文:

> How to extract multiple time from same string in Python?

If you need only time this regex should work fine

r&quot;[0-2][0-9]\:[0-5][0-9]&quot;

If there could be spaces in time like 23 : 59 use this

r&quot;[0-2][0-9]\s*\:\s*[0-5][0-9]&quot;

答案4

得分: 0

你可以使用字典:

my_dict = {}

for i in s.split(', '):
    m = i.strip().split(' : ', 1)
    my_dict[m[0]] = m[1].split()

my_dict
Out: 
{'Dates': ['12/Jul/2019', '12/Aug/2019'],
 'Loc': ['MEISHAN', 'BRIDGE'],
 'Time': ['06:00', '17:58']}
英文:

you could use dictionaries:

my_dict = {}

for i in s.split(&#39;, &#39;):
    m = i.strip().split(&#39; : &#39;, 1)
    my_dict[m[0]] = m[1].split()
    

my_dict
Out: 
{&#39;Dates&#39;: [&#39;12/Jul/2019&#39;, &#39;12/Aug/2019&#39;],
 &#39;Loc&#39;: [&#39;MEISHAN&#39;, &#39;BRIDGE&#39;],
 &#39;Time&#39;: [&#39;06:00&#39;, &#39;17:58&#39;]}

huangapple
  • 本文由 发表于 2020年1月6日 16:15:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/59608714.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定