如何找到日期字符串的断点计数

huangapple go评论70阅读模式
英文:

how to find the breakpoint counts of date string

问题

CASE 1:
这是我的数据框,它有将近四百万行。
每一行的“日期”列都有不同的开始日期、不同的结束日期和天数,日期之间用逗号分隔。
我想要添加两列,一列是包含日期大于2023/4/18的字符串片段,另一列是日期断点的数量,这些日期大于2023/4/18,例如:“2023-1-01,2023-2-10,2023-4-18,2023-5-16,20-5-17,2023-6-19”是“日期”列中的一行,我想要的结果是“2023-4-18,2023-5-16,20-5-17,2023-6-19”和它的断点数为2。

CASE 2:
与上面图片中的表格相同,在MySQL中,如何提取包含大于2023/4/18的日期的行。

我尝试使用for循环来完成这些工作,但是太繁琐了。请帮助我,谢谢。

英文:

CASE 1:
Here is my dataframe ,it has almost 4 million rows
enter image description here
Each row of the 'date' column has a different start date , a different end date and number of days, dates separated by commas.
I want to add two columns, one is a string fragment with date greater than 2023/4/18, the other column is the number of date breakpoints whose date is greater than 2023/4/18, for example:'2023-1-01,2023-2-10,2023-4-18,2023-5-16,20-5-17,2023-6-19' is the row in the 'dates' column ,the result I want to get is '2023-4-18,2023-5-16,20-5-17,2023-6-19' and its breakpoint is 2.
CASE 2:
The same table as the picture above in mysql , how to extract the rows containing dates greater than 2023/4/18.

I try to do these with for but it's too tedious .please help me , thank you .

答案1

得分: 0

以下是翻译好的部分:

可以通过对“date”列应用不同的函数来实现。

在这里,我根据您的示例创建了一个模拟数据帧:

from datetime import datetime, date
import pandas as pd

# 模拟数据帧
df = pd.DataFrame({
    "date": ["2023-01-01,2023-03-30", "2023-01-01,2023-02-10,2023-04-18,2023-05-16,2023-05-17,2023-06-19", "2023-01-01,2023-02-10,2023-04-18,2023-04-19,2023-05-16,2023-05-17,2023-06-19,2023-12-19"]
})

在这里,我创建了两个函数:

  1. filter_dates 用于仅保留大于“2023-04-18”的日期。
  2. count_date_breaks - 用于计算两个相邻日期之间的差异不等于1天的次数。
date_format = "%Y-%m-%d"

def filter_dates(dates, start_date="2023-04-18"):
    dates_parsed = map(lambda x: datetime.strptime(x, date_format), dates.split(","))
    dates_filtered = filter(lambda x: x >= datetime.strptime(start_date, date_format), dates_parsed)
    return ",".join(map(lambda x: datetime.strftime(x, date_format), dates_filtered))

def count_date_breaks(dates):
    date_breaks = 0
    if len(dates.split(",")) == 1:
        return date_breaks
    
    dates_parsed = list(map(
        lambda x: datetime.strptime(x, date_format),
        dates.split(",")
    ))
    
    for date, prev_date in zip(dates_parsed[1:], dates_parsed[:-1]):
        if (date - prev_date).days != 1:
            date_breaks += 1
    return date_breaks

函数应用和最终结果:

df["date_filtered"] = df["date"].apply(filter_dates)
df["date_breaks"] = df["date_filtered"].apply(count_date_breaks)
df

输出:

                                                date  \
0                              2023-01-01,2023-03-30   
1  2023-01-01,2023-02-10,2023-04-18,2023-05-16,20...   
2  2023-01-01,2023-02-10,2023-04-18,2023-04-19,20...   

                                       date_filtered  date_breaks  
0                                                               0  
1        2023-04-18,2023-05-16,2023-05-17,2023-06-19            2  
2  2023-04-18,2023-04-19,2023-05-16,2023-05-17,20...            3  
英文:

It can be achieved by applying different functions to date column.

Here I create a mock dataframe based on your example:

from datetime import datetime, date
import pandas as pd

# mock dataframe
df = pd.DataFrame({
    "date": ["2023-01-01,2023-03-30", "2023-01-01,2023-02-10,2023-04-18,2023-05-16,2023-05-17,2023-06-19", "2023-01-01,2023-02-10,2023-04-18,2023-04-19,2023-05-16,2023-05-17,2023-06-19,2023-12-19"]
})

Here I create two functions:

  1. filter_dates to leave only dates greater than "2023-04-18".
  2. count_date_breaks - to count number of times when difference between two neighbour dates isn't equal to 1 day.
date_format = "%Y-%m-%d"

def filter_dates(dates, start_date="2023-04-18"):
    dates_parsed = map(lambda x: datetime.strptime(x, date_format), dates.split(","))
    dates_filtered = filter(lambda x: x >= datetime.strptime(start_date, date_format), dates_parsed)
    return ",".join(map(lambda x: datetime.strftime(x, date_format), dates_filtered))

def count_date_breaks(dates):
    date_breaks = 0
    if len(dates.split(",")) == 1:
        return date_breaks
    
    dates_parsed = list(map(
        lambda x: datetime.strptime(x, date_format),
        dates.split(",")
    ))
    
    for date, prev_date in zip(dates_parsed[1:], dates_parsed[:-1]):
        if (date - prev_date).days != 1:
            date_breaks += 1
    return date_breaks

Function applications and the final result:

df["date_filtered"] = df["date"].apply(filter_dates)
df["date_breaks"] = df["date_filtered"].apply(count_date_breaks)
df

Output:

                                                date  \
0                              2023-01-01,2023-03-30   
1  2023-01-01,2023-02-10,2023-04-18,2023-05-16,20...   
2  2023-01-01,2023-02-10,2023-04-18,2023-04-19,20...   

                                       date_filtered  date_breaks  
0                                                               0  
1        2023-04-18,2023-05-16,2023-05-17,2023-06-19            2  
2  2023-04-18,2023-04-19,2023-05-16,2023-05-17,20...            3  

huangapple
  • 本文由 发表于 2023年7月17日 20:37:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704526.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定