2023年7月17日 20:37:42go评论70阅读模式

英文:

how to find the breakpoint counts of date string

问题

CASE 1:
这是我的数据框，它有将近四百万行。
每一行的“日期”列都有不同的开始日期、不同的结束日期和天数，日期之间用逗号分隔。
我想要添加两列，一列是包含日期大于2023/4/18的字符串片段，另一列是日期断点的数量，这些日期大于2023/4/18，例如：“2023-1-01,2023-2-10,2023-4-18,2023-5-16,20-5-17,2023-6-19”是“日期”列中的一行，我想要的结果是“2023-4-18,2023-5-16,20-5-17,2023-6-19”和它的断点数为2。

CASE 2:
与上面图片中的表格相同，在MySQL中，如何提取包含大于2023/4/18的日期的行。

我尝试使用for循环来完成这些工作，但是太繁琐了。请帮助我，谢谢。

英文:

CASE 1:
Here is my dataframe ，it has almost 4 million rows
enter image description here
Each row of the 'date' column has a different start date ， a different end date and number of days， dates separated by commas.
I want to add two columns， one is a string fragment with date greater than 2023/4/18, the other column is the number of date breakpoints whose date is greater than 2023/4/18， for example：'2023-1-01,2023-2-10,2023-4-18,2023-5-16,20-5-17,2023-6-19' is the row in the 'dates' column ,the result I want to get is '2023-4-18,2023-5-16,20-5-17,2023-6-19' and its breakpoint is 2.
CASE 2:
The same table as the picture above in mysql ， how to extract the rows containing dates greater than 2023/4/18.

I try to do these with for but it's too tedious .please help me , thank you .

答案1

得分: 0

以下是翻译好的部分：

可以通过对“date”列应用不同的函数来实现。

在这里，我根据您的示例创建了一个模拟数据帧：

from datetime import datetime, date
import pandas as pd

# 模拟数据帧
df = pd.DataFrame({
    "date": ["2023-01-01,2023-03-30", "2023-01-01,2023-02-10,2023-04-18,2023-05-16,2023-05-17,2023-06-19", "2023-01-01,2023-02-10,2023-04-18,2023-04-19,2023-05-16,2023-05-17,2023-06-19,2023-12-19"]
})

在这里，我创建了两个函数：

filter_dates 用于仅保留大于“2023-04-18”的日期。
count_date_breaks - 用于计算两个相邻日期之间的差异不等于1天的次数。

date_format = "%Y-%m-%d"

def filter_dates(dates, start_date="2023-04-18"):
    dates_parsed = map(lambda x: datetime.strptime(x, date_format), dates.split(","))
    dates_filtered = filter(lambda x: x >= datetime.strptime(start_date, date_format), dates_parsed)
    return ",".join(map(lambda x: datetime.strftime(x, date_format), dates_filtered))

def count_date_breaks(dates):
    date_breaks = 0
    if len(dates.split(",")) == 1:
        return date_breaks
    
    dates_parsed = list(map(
        lambda x: datetime.strptime(x, date_format),
        dates.split(",")
    ))
    
    for date, prev_date in zip(dates_parsed[1:], dates_parsed[:-1]):
        if (date - prev_date).days != 1:
            date_breaks += 1
    return date_breaks

函数应用和最终结果：

df["date_filtered"] = df["date"].apply(filter_dates)
df["date_breaks"] = df["date_filtered"].apply(count_date_breaks)
df

输出：

                                                date  \
0                              2023-01-01,2023-03-30   
1  2023-01-01,2023-02-10,2023-04-18,2023-05-16,20...   
2  2023-01-01,2023-02-10,2023-04-18,2023-04-19,20...   

                                       date_filtered  date_breaks  
0                                                               0  
1        2023-04-18,2023-05-16,2023-05-17,2023-06-19            2  
2  2023-04-18,2023-04-19,2023-05-16,2023-05-17,20...            3

英文:

It can be achieved by applying different functions to date column.

Here I create a mock dataframe based on your example:

from datetime import datetime, date
import pandas as pd

# mock dataframe
df = pd.DataFrame({
    &quot;date&quot;: [&quot;2023-01-01,2023-03-30&quot;, &quot;2023-01-01,2023-02-10,2023-04-18,2023-05-16,2023-05-17,2023-06-19&quot;, &quot;2023-01-01,2023-02-10,2023-04-18,2023-04-19,2023-05-16,2023-05-17,2023-06-19,2023-12-19&quot;]
})

Here I create two functions:

filter_dates to leave only dates greater than "2023-04-18".
count_date_breaks - to count number of times when difference between two neighbour dates isn't equal to 1 day.

date_format = &quot;%Y-%m-%d&quot;

def filter_dates(dates, start_date=&quot;2023-04-18&quot;):
    dates_parsed = map(lambda x: datetime.strptime(x, date_format), dates.split(&quot;,&quot;))
    dates_filtered = filter(lambda x: x &gt;= datetime.strptime(start_date, date_format), dates_parsed)
    return &quot;,&quot;.join(map(lambda x: datetime.strftime(x, date_format), dates_filtered))

def count_date_breaks(dates):
    date_breaks = 0
    if len(dates.split(&quot;,&quot;)) == 1:
        return date_breaks
    
    dates_parsed = list(map(
        lambda x: datetime.strptime(x, date_format),
        dates.split(&quot;,&quot;)
    ))
    
    for date, prev_date in zip(dates_parsed[1:], dates_parsed[:-1]):
        if (date - prev_date).days != 1:
            date_breaks += 1
    return date_breaks

Function applications and the final result:

df[&quot;date_filtered&quot;] = df[&quot;date&quot;].apply(filter_dates)
df[&quot;date_breaks&quot;] = df[&quot;date_filtered&quot;].apply(count_date_breaks)
df

Output:

                                                date  \
0                              2023-01-01,2023-03-30   
1  2023-01-01,2023-02-10,2023-04-18,2023-05-16,20...   
2  2023-01-01,2023-02-10,2023-04-18,2023-04-19,20...   

                                       date_filtered  date_breaks  
0                                                               0  
1        2023-04-18,2023-05-16,2023-05-17,2023-06-19            2  
2  2023-04-18,2023-04-19,2023-05-16,2023-05-17,20...            3

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何找到日期字符串的断点计数

问题

答案1

创建一个向量（将多列合并成一个新列）pandas。

数据帧每行根据行中的值高效地选择列中的值。

处理具有互斥列的数据框的有效方法？

Django-rest-framework中的POST方法中的IF语句未执行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论