在Python中检测日期时间字符串的格式

huangapple go评论69阅读模式
英文:

Detect the format of a datetime string in Python

问题

我正在寻找一种方法来检测Python中日期时间字符串的strftime样式格式。我找到的所有日期时间库都具有解析字符串以创建日期时间对象的功能,但我想检测可以与datetime.strptime格式参数一起使用的格式或模式。

为什么这么做呢?我正在处理长列表(或系列)的日期时间字符串,并且使用dateutil.parser来解析它们太不准确且慢。

  • :它会每次都检查所有可能的格式,尽管列表中的每个条目都具有相同的格式(在我的情况下)。
  • 不准确:有歧义的条目将以多种方式之一解析,而不会从其他非模糊的条目中获取知识。

因此,我希望能够检测格式。一旦我有了格式,我就可以使用polars中的to_datetime函数以更快的方式创建日期时间系列。

我在更现代的日期时间库(如pendulum)中找不到这种功能。我还实现了自己的版本,通过迭代固定的格式列表并检查是否可以使用datetime.strptime来读取它,像这样:

patterns = [
    "%Y.%m.%d %H:%M:%S",
    "%Y-%m-%d %H:%M",
    "%Y-%m-%d",
    ...
]

for pattern in patterns:
    try:
        for val in col:
            assert datetime.datetime.strptime(val, pattern)
        return pattern
    except:
        continue

这对我来说不是一个优雅的解决方案,我想知道是否有更好的方法,甚至是否有可用的库可以执行这种操作。

英文:

I'm looking for a way to detect the strftime-style format of a datetime string in Python. All datetime libraries I've found have functionalities for parsing the string to create a datetime object, but I would like to detect the format or pattern that can be used with the datetime.strptime format parameter.

Why? I'm dealing with long lists (or series) of datetime strings and using dateutil.parser to parse them is too inaccurate and slow.

  • Slow: It will check for all potential formats every single time, although all entries per list are of the same format (in my case).
  • Inaccurate: Ambiguous entries will be parsed in one out of multiple ways without drawing knowledge from other entries that are not ambiguous.

So instead I would like to detect the format. Once I have that, I can use the to_datetime function in polars to create a datetime series in a faster manner.

I couldn't find such functionality in more modern datetime libs like pendulum. I also implemented my own version that iterates through a fixed lists of formats and checks if it can be read using datetime.strptime like so:

patterns = [
        "%Y.%m.%d %H:%M:%S",
        "%Y-%m-%d %H:%M",
        "%Y-%m-%d",
        ...
    ]

    for pattern in patterns:
        try:
            for val in col:
                assert datetime.datetime.strptime(val, pattern)
            return pattern
        except:
            continue

This doesn't strike me as an elegant solution and I was wondering if there's a better way to do it or even a library available that does this sort of thing.

答案1

得分: 2

以下是翻译好的部分:

"将你的任务外包给 [tag:pandas],以利用其新的日期解析功能?"

dt_strs = [
"Mon, 29 May 2023 13:15:09 +0000", # Day, DD Month YYYY HH:MM:SS +0000
"10/01/2020 12:15:33", # MM/DD/YYYY HH:MM:SS
"2020-08-01", # YYYY-MM-DD
"08:55", # HH:MM
"2019.09.18T18:51:57", # YYYY.MM.DDTHH:MM:SS
"11:29:10", # HH:MM:SS
"23/05/2022 03:30:00 +0500", # DD/MM/YYYY HH:MM:SS +0000
"02.28.19", # MM.DD.YY
"2023-01-01 22:23", # YYYY-MM-DD HH:MM
"31 jul, 2022", # DD Month, YYYY
"2021/12/18 06:13:08", # YYYY/MM/DD HH:MM:SS
"2023", # YYYY
]

pl_ser = pl.from_pandas(pd.to_datetime(dt_strs, format="mixed").to_series(name="dts"))

输出:

print(pl_ser)

shape: (12,)
Series: 'dts' [datetime[μs, UTC]]
[
2023-05-29 13:15:09 UTC
2020-10-01 12:15:33 UTC
2020-08-01 00:00:00 UTC
2023-05-29 08:55:00 UTC
2019-09-18 18:51:57 UTC
2023-05-29 11:29:10 UTC
2022-05-22 22:30:00 UTC
2019-02-28 00:00:00 UTC
2023-01-01 22:23:00 UTC
2022-07-31 00:00:00 UTC
2021-12-18 06:13:08 UTC
2023-01-01 00:00:00 UTC
]

英文:

What about outsourcing your task to [tag:pandas] to get use of its new mixed date-parsing ?

dt_strs = [
    "Mon, 29 May 2023 13:15:09 +0000",  # Day, DD Month YYYY HH:MM:SS +0000
    "10/01/2020 12:15:33",              # MM/DD/YYYY HH:MM:SS
    "2020-08-01",                       # YYYY-MM-DD
    "08:55",                            # HH:MM
    "2019.09.18T18:51:57",              # YYYY.MM.DDTHH:MM:SS
    "11:29:10",                         # HH:MM:SS
    "23/05/2022 03:30:00 +0500",        # DD/MM/YYYY HH:MM:SS +0000
    "02.28.19",                         # MM.DD.YY
    "2023-01-01 22:23",                 # YYYY-MM-DD HH:MM
    "31 jul, 2022",                     # DD Month, YYYY
    "2021/12/18 06:13:08",              # YYYY/MM/DD HH:MM:SS
    "2023",                             # YYYY
]

pl_ser = pl.from_pandas(pd.to_datetime(dt_strs, format="mixed").to_series(name="dts"))

Output :

print(pl_ser)

shape: (12,)
Series: 'dts' [datetime[μs, UTC]]
[
	2023-05-29 13:15:09 UTC
	2020-10-01 12:15:33 UTC
	2020-08-01 00:00:00 UTC
	2023-05-29 08:55:00 UTC
	2019-09-18 18:51:57 UTC
	2023-05-29 11:29:10 UTC
	2022-05-22 22:30:00 UTC
	2019-02-28 00:00:00 UTC
	2023-01-01 22:23:00 UTC
	2022-07-31 00:00:00 UTC
	2021-12-18 06:13:08 UTC
	2023-01-01 00:00:00 UTC
]

答案2

得分: 2

我看不到另一种只使用Python来将值转换为日期时间的方法。然而,我认为你应该使用另一个变量来记住最后一个成功的模式以提高速度。类似这样:

import datetime

patterns = [
    "%Y-%m-%d %H:%M:%S",
    "%Y.%m.%d %H:%M:%S",
    "%Y-%m-%d",
]


def to_datetime(values, errors='raise'):

    last_success_pattern = patterns[0] 
    dti = []

    for value in values:
        try:
            dt = datetime.datetime.strptime(value, last_success_pattern)
        except ValueError:
            for pattern in [last_success_pattern] + patterns:
                try:
                    dt = datetime.datetime.strptime(value, pattern)
                except ValueError:
                    if errors == 'raise':
                        raise ValueError(f'Unknown format for "{value}"')
                    elif errors == 'coerce':
                        dt = None  # invalid value is set to None
                    elif errors == 'ignore':
                        dt = value  # invalid value returns the input
                else:
                    last_success_pattern = pattern
                    break

        dti.append(dt)
    return dti

if __name__ == '__main__':
    print(to_datetime(['2023.01.01 11:22:33', '2023.01.02 14:15:16', '2023-01-03', '2023-01-01T21:12:33.078196+00:00'], errors='coerce'))

输出:

>>> to_datetime(['2023.01.01 11:22:33', '2023.01.02 14:15:16', 
                 '2023-01-03', '2023-01-01T21:12:33.078196+00:00'],
                errors='coerce')

[datetime.datetime(2023, 1, 1, 11, 22, 33),
 datetime.datetime(2023, 1, 2, 14, 15, 16),
 datetime.datetime(2023, 1, 3, 0, 0),
 None]
英文:

I don't see another method to convert your values as datetime using only Python. However, I think you should use another variable to memorize the last successful pattern to increase the speed. Something like:

import datetime

patterns = [
        "%Y-%m-%d %H:%M:%S",
        "%Y.%m.%d %H:%M:%S",
        "%Y-%m-%d",
]


def to_datetime(values, errors='raise'):

    last_success_pattern = patterns[0] 
    dti = []

    for value in values:
        try:
            dt = datetime.datetime.strptime(value, last_success_pattern)
        except ValueError:
            for pattern in [last_success_pattern] + patterns:
                try:
                    dt = datetime.datetime.strptime(value, pattern)
                except ValueError:
                    if errors == 'raise':
                        raise ValueError(f'Unknown format for "{value}"')
                    elif errors == 'coerce':
                        dt = None  # invalid value is set to None
                    elif errors == 'ignore':
                        dt = value  # invalid value returns the input
                else:
                    last_success_pattern = pattern
                    break

        dti.append(dt)
    return dti

if __name__ == '__main__':
    print(to_datetime(['2023.01.01 11:22:33', '2023.01.02 14:15:16', '2023-01-03', '2023-01-01T21:12:33.078196+00:00'], errors='coerce'))

Output:

>>> to_datetime(['2023.01.01 11:22:33', '2023.01.02 14:15:16', 
                 '2023-01-03', '2023-01-01T21:12:33.078196+00:00'],
                errors='coerce')

[datetime.datetime(2023, 1, 1, 11, 22, 33),
 datetime.datetime(2023, 1, 2, 14, 15, 16),
 datetime.datetime(2023, 1, 3, 0, 0),
 None]

答案3

得分: 0

总结一下,您面临的情况是有多个数据框(dfs),每个数据框都具有内部一致的日期时间格式,但彼此之间的格式不一致。如何处理这种情况?

您有以下内容:

patterns = [
    "%Y.%m.%d %H:%M:%S",
    "%Y-%m-%d %H:%M",
    "%Y-%m-%d",
]

strptime文档中,如果内部格式不一致,您可以使用coalesce

一种想法(可能不适用于您)

(
    df
    .with_columns(Date=pl.coalesce(
        pl.col('Date').str.strptime(pl.Datetime(), x, strict=False) for x in patterns
    ))
)

如果有多种格式可以工作,但只有一种正确的格式,这可能会有问题。

第二个备选方案(可能适合您)

for pattern in patterns:
    try:
        df=df.with_columns(Date=pl.col('Date').str.strptime(pl.Datetime(), pattern))
        break
    except:
        pass

第三个备选方案(可能适合您)

您还可以使用pandas的to_datetime,而无需将整个数据框转换为pandas。

(
    df
        .with_columns(
            Date=pl.col('Date')
              .map(lambda x: pl.Series(pd.to_datetime(x.to_numpy(), format='mixed')))
        )
)

如果您只是在lambda中定义辅助函数,而不是像这样使用lambda,代码看起来会更整洁:

def mixed_strptime(x):
    return pl.Series(pd.to_datetime(x.to_numpy(), format='mixed'))

(
    df
        .with_columns(
            Date=pl.col('Date').map(mixed_strptime))
)
英文:

In summary you are presented with various dfs that each have an internally consistent datetime format but each one is not consistent with the next. How to deal with that?

You have:

 patterns = [
        "%Y.%m.%d %H:%M:%S",
        "%Y-%m-%d %H:%M",
        "%Y-%m-%d",
    ]

From the strptime docs, you can use coalesce if you had internally inconsistent formats:

One idea (probably not for you)

(
    df
    .with_columns(Date=pl.coalesce(
        pl.col('Date').str.strptime(pl.Datetime(), x, strict=False) for x in patterns
        ))
)

This can be problematic if you have a mix of formats where many will work but only one is right.

Second alternative (might be good for you)

for pattern in patterns:
    try:
        df=df.with_columns(Date=pl.col('Date').str.strptime(pl.Datetime(), pattern))
        break
    except:
        pass

Third alternative (might be good for you)

You can also use the pandas to_datetime without converting your whole df to pandas.

(
    df
        .with_columns(
            Date=pl.col('Date')
              .map(lambda x: pl.Series(pd.to_datetime(x.to_numpy(), format='mixed'))))
)

You can that look a bit neater if you just define the helper function in the lambda instead of using the lambda like this:

def mixed_strptime(x):
    return pl.Series(pd.to_datetime(x.to_numpy(), format='mixed'))
(
    df
        .with_columns(
            Date=pl.col('Date').map(mixed_strptime))
)

答案4

得分: 0

感谢所有为此做出贡献的人。我自己也深入研究了一下,只是注意到这确实是一个深度的问题。

第一种方法是基于我的初始想法,通过create_dt_patterns()函数进行扩展,该函数生成了各种各样的模式组合。在这个列表超过500个模式后,我稍微调整了代码,以仅生成适合字符串中分隔符的模式。我将这个与计数器结合在一起,以仅返回最常用的模式。这有助于解决模糊的结果。如果你对我潜在的丑陋和不完整的代码感兴趣,这里是代码:

def get_dt_format(col: list, patterns: list, n: int = 100):
    for val in col[:n]:
        for pattern in patterns.keys():
            try:
                _ = datetime.strptime(val, pattern)
                patterns[pattern] += 1
            except:
                pass

    if sum(patterns.values()) == 0:
        return False

    return max(patterns, key=patterns.get)

def create_dt_patterns(dt_str: str):
    dates = create_date_patterns(dt_str)
    times = create_time_patterns(dt_str)
    seps = create_dt_separators(dt_str)
    dts = [d + s + t for d in dates for s in seps for t in times]
    dts += dates + times
    return {x: 0 for x in dts}

def create_dt_separators(dt_str: str):
    seps = [x for x in ["T", " "] if x in dt_str]
    return [""] if len(seps) == 0 else seps

def create_time_patterns(dt_str: str):
    time_opts = ["%H{0}%M{0}%S.%f", "%H{0}%M{0}%S", "%H{0}%M"]
    return [t.format(":" if ":" in dt_str else "") for t in time_opts]

def create_date_patterns(dt_str: str):
    date_opts = [
        "%Y{0}%m{0}%d",
        "%m{0}%d{0}%Y",
        "%d{0}%m{0}%Y",
        "%y{0}%m{0}%d",
        "%m{0}%d{0}%y",
        "%d{0}%m{0}%y",
    ]
    dates = [d.format(s) for d in date_opts for s in ["-", "/", "."] if s in dt_str]
    if len(dates) == 0:
        dates = [d.format("") for d in date_opts]
    return dates

然后,我洗了个澡,因为写这段代码让我感到很不舒服,还注意到strptime不是一个非常快速的函数。对于n个条目使用m个不同的模式在k列上进行操作需要一些时间。

下一个尝试是正则表达式方法,它要快得多,也让我对自己感到好一点。它大约比尝试不同模式快30倍。

由于我打算随着时间的推移测试和开发这个逻辑,我为此创建了一个GitHub存储库。为了也在这里发布它,尽管我可能不会更新它:

DATE_RE = r"(?P<date>\d{2,4}[-/.]\d{2}[-/.]\d{2,4})?"
SEP_RE = r"(?P<sep>\s|T)?"
TIME_RE = r"(?P<time>\d{2}:\d{2}(:\d{2})?\s*([AP]M)?)?"
FULL_RE = DATE_RE + SEP_RE + TIME_RE
YMD_RE = r"^(?P<ay>(?:[12][0-9])?[0-9]{2})(?P<bs>[-/.])(?P<cm>0[1-9]|1[0-2])(?P<ds>[-/.])(?P<ed>0[1-9]|[12][0-9]|3[01])$"
DMY_RE = r"^(?P<ad>0[1-9]|[12][0-9]|3[01])(?P<bs>[-/.])(?P<cm>0[1-9]|1[0-2])(?P<ds>[-/.])(?P<ey>(?:[12][0-9])?[0-9]{2})$"
MDY_RE = r"^(?P<am>0[1-9]|1[0-2])(?P<bs>[-/.])(?P<cd>0[1-9]|[12][0-9]|3[01])(?P<ds>[-/.])(?P<ey>(?:[12][0-9])?[0-9]{2})$"
HMS_RE = r"^(?P<aH>\d{1,2})(?P<bs>:?)(?P<cM>\d{2})(?:(?P<ds>:?)(?P<eS>\d{2}))?(?:(?P<fs>\s)?(?P<ga>[AP]M))?$"

def guess_datetime_format(values: list[str], n=100, return_dict=False):
    di = {}
    for val in values[:n]:
        if val is None:
            continue
        fmts = datetime_formats(val)
        for fmt in fmts:
            if fmt not in di:
                di[fmt] = 1
            di[fmt] += 1

    if len(di) == 0:
        return None

    if return_dict:
        return di

    return max(di, key=di.get)

def datetime_formats(value: str) -> list:
    assert "," not in value  # TODO: 处理这些情况
    m = re.match(FULL_RE, value)
    dates = "" if m["date"] is None else date_formats(m["date"])
    sep = "" if m["sep"] is None else m["sep"]
    time = "" if m["time"] is None else time_format(m["time"])
    return [date + sep + time for date in dates]

def date_formats(date_value: str) -> list:
    matches = []
    for p in [YMD_RE, MDY_RE, DMY_RE]:
        m = re.match(p, date_value)
        if m is None:
            continue
        fmt = ""
        for c in sorted(m.groupdict().keys()):
            if c[1] == "s":  # 分隔符字符
                fmt

<details>
<summary>英文:</summary>

Thanks to everyone who contributed. I went down the rabbit hole myself, only to notice that it&#39;s indeed a deep rabbit hole.

The first approach was based on my initial idea extended by a create_dt_patterns() function, that generates a wide combination of patterns. After this list exceeded 500 patterns, I tweaked the code a bit to produce only patterns that fit the separators in the string. I combined this with a counter to only return the most frequently used pattern. This helps with ambiguous results. If you&#39;re interested in my potentially ugly and incomplete code, here you go:

    def get_dt_format(col: list, patterns: list, n: int = 100):
        for val in col[:n]:
            for pattern in patterns.keys():
                try:
                    _ = datetime.strptime(val, pattern)
                    patterns[pattern] += 1
                except:
                    pass
    
        if sum(patterns.values()) == 0:
            return False
    
        return max(patterns, key=patterns.get)
    
    def create_dt_patterns(dt_str: str):
        dates = create_date_patterns(dt_str)
        times = create_time_patterns(dt_str)
        seps = create_dt_separators(dt_str)
        dts = [d + s + t for d in dates for s in seps for t in times]
        dts += dates + times
        return {x: 0 for x in dts}
    
    
    def create_dt_separators(dt_str: str):
        seps = [x for x in [&quot;T&quot;, &quot; &quot;] if x in dt_str]
        return [&quot;&quot;] if len(seps) == 0 else seps
    
    
    def create_time_patterns(dt_str: str):
        time_opts = [&quot;%H{0}%M{0}%S.%f&quot;, &quot;%H{0}%M{0}%S&quot;, &quot;%H{0}%M&quot;]
        return [t.format(&quot;:&quot; if &quot;:&quot; in dt_str else &quot;&quot;) for t in time_opts]
    
    
    def create_date_patterns(dt_str: str):
        date_opts = [
            &quot;%Y{0}%m{0}%d&quot;,
            &quot;%m{0}%d{0}%Y&quot;,
            &quot;%d{0}%m{0}%Y&quot;,
            &quot;%y{0}%m{0}%d&quot;,
            &quot;%m{0}%d{0}%y&quot;,
            &quot;%d{0}%m{0}%y&quot;,
        ]
        dates = [d.format(s) for d in date_opts for s in [&quot;-&quot;, &quot;/&quot;, &quot;.&quot;] if s in dt_str]
        if len(dates) == 0:
            dates = [d.format(&quot;&quot;) for d in date_opts]
        return dates


Then I took a shower because I felt dirty writing this code and also noticed that `strptime` is not a very fast function. Doing this for `n` entries using `m` different patterns over `k` columns takes some time.

The next stop was a RegEx approach, which is much quicker and made me feel a little better about myself. It&#39;s roughly 30x faster than trying different patterns.

Since I&#39;m planning on testing and developing this logic over time, I created a [GitHub repo][1] for it. To also post it here, which I probably will not update:

    DATE_RE = r&quot;(?P&lt;date&gt;\d{2,4}[-/.]\d{2}[-/.]\d{2,4})?&quot;
    SEP_RE = r&quot;(?P&lt;sep&gt;\s|T)?&quot;
    TIME_RE = r&quot;(?P&lt;time&gt;\d{2}:\d{2}(:\d{2})?\s*([AP]M)?)?&quot;
    FULL_RE = DATE_RE + SEP_RE + TIME_RE
    YMD_RE = r&quot;^(?P&lt;ay&gt;(?:[12][0-9])?[0-9]{2})(?P&lt;bs&gt;[-/.])(?P&lt;cm&gt;0[1-9]|1[0-2])(?P&lt;ds&gt;[-/.])(?P&lt;ed&gt;0[1-9]|[12][0-9]|3[01])$&quot;
    DMY_RE = r&quot;^(?P&lt;ad&gt;0[1-9]|[12][0-9]|3[01])(?P&lt;bs&gt;[-/.])(?P&lt;cm&gt;0[1-9]|1[0-2])(?P&lt;ds&gt;[-/.])(?P&lt;ey&gt;(?:[12][0-9])?[0-9]{2})$&quot;
    MDY_RE = r&quot;^(?P&lt;am&gt;0[1-9]|1[0-2])(?P&lt;bs&gt;[-/.])(?P&lt;cd&gt;0[1-9]|[12][0-9]|3[01])(?P&lt;ds&gt;[-/.])(?P&lt;ey&gt;(?:[12][0-9])?[0-9]{2})$&quot;
    HMS_RE = r&quot;^(?P&lt;aH&gt;\d{1,2})(?P&lt;bs&gt;:?)(?P&lt;cM&gt;\d{2})(?:(?P&lt;ds&gt;:?)(?P&lt;eS&gt;\d{2}))?(?:(?P&lt;fs&gt;\s)?(?P&lt;ga&gt;[AP]M))?$&quot;
    
    
    def guess_datetime_format(values: list[str], n=100, return_dict=False):
        di = {}
        for val in values[:n]:
            if val is None:
                continue
            fmts = datetime_formats(val)
            for fmt in fmts:
                if fmt not in di:
                    di[fmt] = 1
                di[fmt] += 1
    
        if len(di) == 0:
            return None
    
        if return_dict:
            return di
    
        return max(di, key=di.get)
    
    
    def datetime_formats(value: str) -&gt; list:
        assert &quot;,&quot; not in value  # TODO: handle these cases
        m = re.match(FULL_RE, value)
        dates = &quot;&quot; if m[&quot;date&quot;] is None else date_formats(m[&quot;date&quot;])
        sep = &quot;&quot; if m[&quot;sep&quot;] is None else m[&quot;sep&quot;]
        time = &quot;&quot; if m[&quot;time&quot;] is None else time_format(m[&quot;time&quot;])
        return [date + sep + time for date in dates]
    
    
    def date_formats(date_value: str) -&gt; list:
        matches = []
        for p in [YMD_RE, MDY_RE, DMY_RE]:
            m = re.match(p, date_value)
            if m is None:
                continue
            fmt = &quot;&quot;
            for c in sorted(m.groupdict().keys()):
                if c[1] == &quot;s&quot;:  # separator character
                    fmt += &quot;&quot; if m[c] is None else m[c]
                else:  # year, month, day
                    fmt += &quot;%&quot; + c[1] if len(m[c]) == 2 else &quot;%Y&quot;
            matches.append(fmt)
        return matches
    
    
    def time_format(time_value: str) -&gt; str:
        m = re.match(HMS_RE, time_value)
        fmt = &quot;&quot;
        for c in sorted(m.groupdict().keys()):
            if c[1] == &quot;s&quot;:  # separator character
                fmt += &quot;&quot; if m[c] is None else m[c]
            else:
                fmt += &quot;&quot; if m[c] is None else &quot;%&quot; + c[1]
        if &quot;M&quot; in time_value:  # AM or PM
            fmt = fmt.replace(&quot;%H&quot;, &quot;%I&quot;)
        return fmt


  [1]: https://github.com/pietz/datetime-format

</details>



huangapple
  • 本文由 发表于 2023年5月29日 16:03:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76355619.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定