英文:
Inserting Rows with Consecutive Dates for Different Projects with Different Start and End Dates in pandas
问题
以下是您提供的信息的翻译:
# 有一个名为data的数据框,其中包含项目记录,大致如下:
project = ['Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 2', 'Project 2', 'Project 2', 'Project 3', 'Project 3', 'Project 3', 'Project 4', 'Project 5', 'Project 5', 'Project 5']
date = ['2010-10-12', '2010-10-15', '2010-10-20', '2010-10-22', '2012-05-05', '2012-05-07', '2012-05-10', '2018-01-01', '2018-01-05', '2018-01-06', '2019-10-02', '2010-02-02', '2010-02-04', '2010-02-07']
date = pd.to_datetime(date)
hours = [0, 1, 0, 2, 4, 0, 2, 1, 0, 2, 4, 2, 4, 3]
taskcount = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 0, 0, 2]
data = pd.DataFrame({'Project': project, 'Date': date, 'Hours': hours, 'TaskCount': taskcount})
列"Hours"显示了在给定日期上为特定项目工作的小时数,而列"TaskCount"则给出了在给定日期上完成的任务数。
我有第二个数据框"project_duration",其中包含了有关数据框"data"中每个项目的持续时间信息:
column_DateFirstRecord = data.groupby('Project').apply(lambda df: df.Date.min())
column_DateLastRecord = data.groupby('Project').apply(lambda df: df.Date.max())
project_duration = pd.concat([column_DateFirstRecord, column_DateLastRecord], axis=1)
project_duration.columns = ['DateFirstRecord', 'DateLastRecord']
project_duration = project_duration.assign(ProjectLength=(project_duration.DateLastRecord - project_duration.DateFirstRecord))
对于数据框"data"中的每个项目,我需要将缺少的日期添加到数据框"data"中,从第一条记录的日期到该特定项目的最后一条记录的日期。例如,对于项目1,我需要为2010年1月13日至16日、17日至19日以及21日添加行到数据框"data"。请注意,这些新行应在"Hours"和"TaskCount"列中具有值0。
我正在寻找的输出应该类似于下面创建的数据框"data_output":
date_output = ['2010-10-12', '2010-10-13', '2010-10-14', '2010-10-15', '2010-10-16', '2010-10-17', '2010-10-18', '2010-10-19', '2010-10-20', '2010-10-21', '2010-10-22', '2012-05-05', '2012-05-06', '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10', '2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2019-10-02', '2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05', '2010-02-06', '2010-02-07']
date_output = pd.to_datetime(date_output)
project_output = ['Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 1', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 4', 'Project 5', 'Project 5', 'Project 5', 'Project 5', 'Project 5', 'Project 5']
hours_output = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 4, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 2, 4, 2, 0, 4, 0, 0, 3]
taskcount_output = [0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 2]
data_output = pd.DataFrame({'Project': project_output, 'Date': date_output, 'Hours': hours_output, 'TaskCount': taskcount_output})
值得注意的是,我正在处理的实际数据框非常大,包含约278,000行,因此我希望找到一种高效的解决方案。我尝试了这个StackOverflow帖子中详细介绍的方法:https://stackoverflow.com/questions/44978196/pandas-filling-missing-dates-and-values-within-group,但它不适用于每个项目的不同开始和结束日期。
英文:
I have a df called data of project records that looks somewhat like this:
project = ['Project 1','Project 1','Project 1','Project 1','Project 2','Project 2','Project 2','Project 3','Project 3','Project 3','Project 4','Project 5','Project 5','Project 5']
date = ['2010-10-12','2010-10-15','2010-10-20','2010-10-22','2012-05-05','2012-05-07','2012-05-10','2018-01-01','2018-01-05','2018-01-06','2019-10-02','2010-02-02','2010-02-04','2010-02-07']
date = pd.to_datetime(date)
hours = [0,1,0,2,4,0,2,1,0,2,4,2,4,3]
taskcount = [0,1,2,0,1,2,0,1,2,0,1,0,0,2]
data = pd.DataFrame({'Project':project, 'Date':date, 'Hours':hours,'TaskCount':taskcount})
The column Hours shows the number of hours worked on the particular project on the given date, while the column TaskCount gives a count of the number of tasks completed for that project on the given date.
I have a second df called project_duration containing info about the duration of each of the projects in the df data:
column_DateFirstRecord = data.groupby('Project').apply(lambda df: df.Date.min())
column_DateLastRecord = data.groupby('Project').apply(lambda df: df.Date.max())
project_duration = pd.concat([column_DateFirstRecord, column_DateLastRecord], axis=1)
project_duration.columns = ['DateFirstRecord', 'DateLastRecord']
project_duration = project_duration.assign(ProjectLength = (project_duration.DateLastRecord - project_duration.DateFirstRecord))
For each project in the df data, I need to append rows to the df data with the missing dates from the date of the first record to the date of the last record for that particular project. For instance, for Project 1, I need to add rows to the df data for Jan 13-16, 17-19, and 21, 2010. Note that these new rows should have the value 0 in the columns Hours and TaskCount.
The output that I'm looking for should look like the df data_output that I've created below:
date_output = ['2010-10-12', '2010-10-13','2010-10-14','2010-10-15','2010-10-16','2010-10-17','2010-10-18','2010-10-19','2010-10-20','2010-10-21','2010-10-22','2012-05-05','2012-05-06','2012-05-07','2012-05-08','2012-05-09','2012-05-10','2018-01-01', '2018-01-02','2018-01-03','2018-01-04','2018-01-05','2018-01-06','2019-10-02','2010-02-02','2010-02-03','2010-02-04','2010-02-05','2010-02-06','2010-02-07']
date_output = pd.to_datetime(date_output)
project_output = ['Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 2','Project 2','Project 2','Project 2','Project 2','Project 2','Project 3','Project 3','Project 3','Project 3','Project 3','Project 3','Project 4','Project 5','Project 5','Project 5','Project 5','Project 5','Project 5']
hours_output = [0,0,0,1,0,0,0,0,0,0,2,4,0,0,0,0,2,1,0,0,0,0,2,4,2,0,4,0,0,3]
taskcount_output = [0,0,0,1,0,0,0,0,2,0,0,1,0,2,0,0,0,1,0,0,0,2,0,1,0,0,0,0,0,2]
data_output = pd.DataFrame({'Project':project_output, 'Date':date_output, 'Hours':hours_output,'TaskCount':taskcount_output})
I should also note that the real dfs that I'm working with are very large - they comprise about 278,000 rows - so I'm hoping to find an efficient solution. I tried the method detailed in this StackOverflow post: https://stackoverflow.com/questions/44978196/pandas-filling-missing-dates-and-values-within-group , but it didn't allow for the different start and end dates for each project.
答案1
得分: 1
以下是代码的中文翻译:
一种选项:
```python
(data.groupby('Project')
.apply(lambda g: g.set_index('Date')
.reindex(pd.date_range(project_duration.loc[g.name, 'DateFirstRecord'],
project_duration.loc[g.name, 'DateLastRecord']
).rename('Date'),
fill_value=0
)
)[['Hours', 'TaskCount']]
.reset_index()
)
输出:
Project Date Hours TaskCount
0 Project 1 2010-10-12 0 0
1 Project 1 2010-10-13 0 0
2 Project 1 2010-10-14 0 0
3 Project 1 2010-10-15 1 1
4 Project 1 2010-10-16 0 0
5 Project 1 2010-10-17 0 0
6 Project 1 2010-10-18 0 0
7 Project 1 2010-10-19 0 0
8 Project 1 2010-10-20 0 2
9 Project 1 2010-10-21 0 0
10 Project 1 2010-10-22 2 0
11 Project 2 2012-05-05 4 1
12 Project 2 2012-05-06 0 0
13 Project 2 2012-05-07 0 2
14 Project 2 2012-05-08 0 0
15 Project 2 2012-05-09 0 0
16 Project 2 2012-05-10 2 0
17 Project 3 2018-01-01 1 1
18 Project 3 2018-01-02 0 0
19 Project 3 2018-01-03 0 0
20 Project 3 2018-01-04 0 0
21 Project 3 2018-01-05 0 2
22 Project 3 2018-01-06 2 0
23 Project 4 2019-10-02 4 1
24 Project 5 2010-02-02 2 0
25 Project 5 2010-02-03 0 0
26 Project 5 2010-02-04 4 0
27 Project 5 2010-02-05 0 0
28 Project 5 2010-02-06 0 0
29 Project 5 2010-02-07 3 2
希望这对您有帮助。如果您有任何其他问题,可以继续提问。
英文:
One option:
(data.groupby('Project')
.apply(lambda g: g.set_index('Date')
.reindex(pd.date_range(project_duration.loc[g.name, 'DateFirstRecord'],
project_duration.loc[g.name, 'DateLastRecord']
).rename('Date'),
fill_value=0
)
)[['Hours', 'TaskCount']]
.reset_index()
)
Output:
Project Date Hours TaskCount
0 Project 1 2010-10-12 0 0
1 Project 1 2010-10-13 0 0
2 Project 1 2010-10-14 0 0
3 Project 1 2010-10-15 1 1
4 Project 1 2010-10-16 0 0
5 Project 1 2010-10-17 0 0
6 Project 1 2010-10-18 0 0
7 Project 1 2010-10-19 0 0
8 Project 1 2010-10-20 0 2
9 Project 1 2010-10-21 0 0
10 Project 1 2010-10-22 2 0
11 Project 2 2012-05-05 4 1
12 Project 2 2012-05-06 0 0
13 Project 2 2012-05-07 0 2
14 Project 2 2012-05-08 0 0
15 Project 2 2012-05-09 0 0
16 Project 2 2012-05-10 2 0
17 Project 3 2018-01-01 1 1
18 Project 3 2018-01-02 0 0
19 Project 3 2018-01-03 0 0
20 Project 3 2018-01-04 0 0
21 Project 3 2018-01-05 0 2
22 Project 3 2018-01-06 2 0
23 Project 4 2019-10-02 4 1
24 Project 5 2010-02-02 2 0
25 Project 5 2010-02-03 0 0
26 Project 5 2010-02-04 4 0
27 Project 5 2010-02-05 0 0
28 Project 5 2010-02-06 0 0
29 Project 5 2010-02-07 3 2
答案2
得分: 1
以下是代码的翻译部分:
x = data.groupby('Project').apply(
lambda x: (tmp:=x.set_index('Date').asfreq('1D')).assign(
Project=tmp['Project'].ffill(),
Hours=tmp['Hours'].fillna(0).astype(int),
TaskCount=tmp.TaskCount.fillna(0).astype(int))
).droplevel(0).reset_index()
print(x)
打印结果:
Date Project Hours TaskCount
0 2010-10-12 Project 1 0 0
1 2010-10-13 Project 1 0 0
2 2010-10-14 Project 1 0 0
3 2010-10-15 Project 1 1 1
4 2010-10-16 Project 1 0 0
5 2010-10-17 Project 1 0 0
6 2010-10-18 Project 1 0 0
7 2010-10-19 Project 1 0 0
8 2010-10-20 Project 1 0 2
9 2010-10-21 Project 1 0 0
10 2010-10-22 Project 1 2 0
11 2012-05-05 Project 2 4 1
12 2012-05-06 Project 2 0 0
13 2012-05-07 Project 2 0 2
14 2012-05-08 Project 2 0 0
15 2012-05-09 Project 2 0 0
16 2012-05-10 Project 2 2 0
17 2018-01-01 Project 3 1 1
18 2018-01-02 Project 3 0 0
19 2018-01-03 Project 3 0 0
20 2018-01-04 Project 3 0 0
21 2018-01-05 Project 3 0 2
22 2018-01-06 Project 3 2 0
23 2019-10-02 Project 4 4 1
24 2010-02-02 Project 5 2 0
25 2010-02-03 Project 5 0 0
26 2010-02-04 Project 5 4 0
27 2010-02-05 Project 5 0 0
28 2010-02-06 Project 5 0 0
29 2010-02-07 Project 5 3 2
希望这有所帮助。如果您需要更多翻译,请告诉我。
英文:
Try:
x = data.groupby('Project').apply(
lambda x: (tmp:=x.set_index('Date').asfreq('1D')).assign(
Project=tmp['Project'].ffill(),
Hours=tmp['Hours'].fillna(0).astype(int),
TaskCount=tmp.TaskCount.fillna(0).astype(int))
).droplevel(0).reset_index()
print(x)
Prints:
Date Project Hours TaskCount
0 2010-10-12 Project 1 0 0
1 2010-10-13 Project 1 0 0
2 2010-10-14 Project 1 0 0
3 2010-10-15 Project 1 1 1
4 2010-10-16 Project 1 0 0
5 2010-10-17 Project 1 0 0
6 2010-10-18 Project 1 0 0
7 2010-10-19 Project 1 0 0
8 2010-10-20 Project 1 0 2
9 2010-10-21 Project 1 0 0
10 2010-10-22 Project 1 2 0
11 2012-05-05 Project 2 4 1
12 2012-05-06 Project 2 0 0
13 2012-05-07 Project 2 0 2
14 2012-05-08 Project 2 0 0
15 2012-05-09 Project 2 0 0
16 2012-05-10 Project 2 2 0
17 2018-01-01 Project 3 1 1
18 2018-01-02 Project 3 0 0
19 2018-01-03 Project 3 0 0
20 2018-01-04 Project 3 0 0
21 2018-01-05 Project 3 0 2
22 2018-01-06 Project 3 2 0
23 2019-10-02 Project 4 4 1
24 2010-02-02 Project 5 2 0
25 2010-02-03 Project 5 0 0
26 2010-02-04 Project 5 4 0
27 2010-02-05 Project 5 0 0
28 2010-02-06 Project 5 0 0
29 2010-02-07 Project 5 3 2
答案3
得分: 0
以下是您要翻译的内容:
col1=project_duration.apply(lambda ss:pd.date_range(ss.DateFirstRecord,ss.DateLastRecord),axis=1)
col1.explode().to_frame("Date").reset_index().set_index(["Project","Date"])\
.join(data.set_index(["Project","Date"])).fillna(0).reset_index()
输出:
Date Project Hours TaskCount
0 2010-10-12 Project 1 0 0
1 2010-10-13 Project 1 0 0
2 2010-10-14 Project 1 0 0
3 2010-10-15 Project 1 1 1
4 2010-10-16 Project 1 0 0
5 2010-10-17 Project 1 0 0
6 2010-10-18 Project 1 0 0
7 2010-10-19 Project 1 0 0
8 2010-10-20 Project 1 0 2
9 2010-10-21 Project 1 0 0
10 2010-10-22 Project 1 2 0
11 2012-05-05 Project 2 4 1
12 2012-05-06 Project 2 0 0
13 2012-05-07 Project 2 0 2
14 2012-05-08 Project 2 0 0
15 2012-05-09 Project 2 0 0
16 2012-05-10 Project 2 2 0
17 2018-01-01 Project 3 1 1
18 2018-01-02 Project 3 0 0
19 2018-01-03 Project 3 0 0
20 2018-01-04 Project 3 0 0
21 2018-01-05 Project 3 0 2
22 2018-01-06 Project 3 2 0
23 2019-10-02 Project 4 4 1
24 2010-02-02 Project 5 2 0
25 2010-02-03 Project 5 0 0
26 2010-02-04 Project 5 4 0
27 2010-02-05 Project 5 0 0
28 2010-02-06 Project 5 0 0
29 2010-02-07 Project 5 3 2
希望这对您有所帮助。
英文:
col1=project_duration.apply(lambda ss:pd.date_range(ss.DateFirstRecord,ss.DateLastRecord),axis=1)
col1.explode().to_frame("Date").reset_index().set_index(["Project","Date"])\
.join(data.set_index(["Project","Date"])).fillna(0).reset_index()
out:
Date Project Hours TaskCount
0 2010-10-12 Project 1 0 0
1 2010-10-13 Project 1 0 0
2 2010-10-14 Project 1 0 0
3 2010-10-15 Project 1 1 1
4 2010-10-16 Project 1 0 0
5 2010-10-17 Project 1 0 0
6 2010-10-18 Project 1 0 0
7 2010-10-19 Project 1 0 0
8 2010-10-20 Project 1 0 2
9 2010-10-21 Project 1 0 0
10 2010-10-22 Project 1 2 0
11 2012-05-05 Project 2 4 1
12 2012-05-06 Project 2 0 0
13 2012-05-07 Project 2 0 2
14 2012-05-08 Project 2 0 0
15 2012-05-09 Project 2 0 0
16 2012-05-10 Project 2 2 0
17 2018-01-01 Project 3 1 1
18 2018-01-02 Project 3 0 0
19 2018-01-03 Project 3 0 0
20 2018-01-04 Project 3 0 0
21 2018-01-05 Project 3 0 2
22 2018-01-06 Project 3 2 0
23 2019-10-02 Project 4 4 1
24 2010-02-02 Project 5 2 0
25 2010-02-03 Project 5 0 0
26 2010-02-04 Project 5 4 0
27 2010-02-05 Project 5 0 0
28 2010-02-06 Project 5 0 0
29 2010-02-07 Project 5 3 2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论