英文:
how to find the breakpoint counts of date string
问题
CASE 1:
这是我的数据框,它有将近四百万行。
每一行的“日期”列都有不同的开始日期、不同的结束日期和天数,日期之间用逗号分隔。
我想要添加两列,一列是包含日期大于2023/4/18的字符串片段,另一列是日期断点的数量,这些日期大于2023/4/18,例如:“2023-1-01,2023-2-10,2023-4-18,2023-5-16,20-5-17,2023-6-19”是“日期”列中的一行,我想要的结果是“2023-4-18,2023-5-16,20-5-17,2023-6-19”和它的断点数为2。
CASE 2:
与上面图片中的表格相同,在MySQL中,如何提取包含大于2023/4/18的日期的行。
我尝试使用for循环来完成这些工作,但是太繁琐了。请帮助我,谢谢。
英文:
CASE 1:
Here is my dataframe ,it has almost 4 million rows
enter image description here
Each row of the 'date' column has a different start date , a different end date and number of days, dates separated by commas.
I want to add two columns, one is a string fragment with date greater than 2023/4/18, the other column is the number of date breakpoints whose date is greater than 2023/4/18, for example:'2023-1-01,2023-2-10,2023-4-18,2023-5-16,20-5-17,2023-6-19' is the row in the 'dates' column ,the result I want to get is '2023-4-18,2023-5-16,20-5-17,2023-6-19' and its breakpoint is 2.
CASE 2:
The same table as the picture above in mysql , how to extract the rows containing dates greater than 2023/4/18.
I try to do these with for but it's too tedious .please help me , thank you .
答案1
得分: 0
以下是翻译好的部分:
可以通过对“date”列应用不同的函数来实现。
在这里,我根据您的示例创建了一个模拟数据帧:
from datetime import datetime, date
import pandas as pd
# 模拟数据帧
df = pd.DataFrame({
"date": ["2023-01-01,2023-03-30", "2023-01-01,2023-02-10,2023-04-18,2023-05-16,2023-05-17,2023-06-19", "2023-01-01,2023-02-10,2023-04-18,2023-04-19,2023-05-16,2023-05-17,2023-06-19,2023-12-19"]
})
在这里,我创建了两个函数:
filter_dates
用于仅保留大于“2023-04-18”的日期。count_date_breaks
- 用于计算两个相邻日期之间的差异不等于1天的次数。
date_format = "%Y-%m-%d"
def filter_dates(dates, start_date="2023-04-18"):
dates_parsed = map(lambda x: datetime.strptime(x, date_format), dates.split(","))
dates_filtered = filter(lambda x: x >= datetime.strptime(start_date, date_format), dates_parsed)
return ",".join(map(lambda x: datetime.strftime(x, date_format), dates_filtered))
def count_date_breaks(dates):
date_breaks = 0
if len(dates.split(",")) == 1:
return date_breaks
dates_parsed = list(map(
lambda x: datetime.strptime(x, date_format),
dates.split(",")
))
for date, prev_date in zip(dates_parsed[1:], dates_parsed[:-1]):
if (date - prev_date).days != 1:
date_breaks += 1
return date_breaks
函数应用和最终结果:
df["date_filtered"] = df["date"].apply(filter_dates)
df["date_breaks"] = df["date_filtered"].apply(count_date_breaks)
df
输出:
date \
0 2023-01-01,2023-03-30
1 2023-01-01,2023-02-10,2023-04-18,2023-05-16,20...
2 2023-01-01,2023-02-10,2023-04-18,2023-04-19,20...
date_filtered date_breaks
0 0
1 2023-04-18,2023-05-16,2023-05-17,2023-06-19 2
2 2023-04-18,2023-04-19,2023-05-16,2023-05-17,20... 3
英文:
It can be achieved by applying different functions to date
column.
Here I create a mock dataframe based on your example:
from datetime import datetime, date
import pandas as pd
# mock dataframe
df = pd.DataFrame({
"date": ["2023-01-01,2023-03-30", "2023-01-01,2023-02-10,2023-04-18,2023-05-16,2023-05-17,2023-06-19", "2023-01-01,2023-02-10,2023-04-18,2023-04-19,2023-05-16,2023-05-17,2023-06-19,2023-12-19"]
})
Here I create two functions:
filter_dates
to leave only dates greater than"2023-04-18"
.count_date_breaks
- to count number of times when difference between two neighbour dates isn't equal to 1 day.
date_format = "%Y-%m-%d"
def filter_dates(dates, start_date="2023-04-18"):
dates_parsed = map(lambda x: datetime.strptime(x, date_format), dates.split(","))
dates_filtered = filter(lambda x: x >= datetime.strptime(start_date, date_format), dates_parsed)
return ",".join(map(lambda x: datetime.strftime(x, date_format), dates_filtered))
def count_date_breaks(dates):
date_breaks = 0
if len(dates.split(",")) == 1:
return date_breaks
dates_parsed = list(map(
lambda x: datetime.strptime(x, date_format),
dates.split(",")
))
for date, prev_date in zip(dates_parsed[1:], dates_parsed[:-1]):
if (date - prev_date).days != 1:
date_breaks += 1
return date_breaks
Function applications and the final result:
df["date_filtered"] = df["date"].apply(filter_dates)
df["date_breaks"] = df["date_filtered"].apply(count_date_breaks)
df
Output:
date \
0 2023-01-01,2023-03-30
1 2023-01-01,2023-02-10,2023-04-18,2023-05-16,20...
2 2023-01-01,2023-02-10,2023-04-18,2023-04-19,20...
date_filtered date_breaks
0 0
1 2023-04-18,2023-05-16,2023-05-17,2023-06-19 2
2 2023-04-18,2023-04-19,2023-05-16,2023-05-17,20... 3
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论