英文:
python dataframe. Insert row based on various criteria
问题
以下是已翻译好的部分:
In the following dataframe, each row represents a part of a shift for a day.
import pandas as pd
df = pd.DataFrame()
df['number'] = '169', '169', '169', '169', '2129', '2129', '2129', '2129', '2129', '2129', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '7480', '7480', '7480', '7480', '7480', '7480', '7480', '6713', '6713', '6713', '6713', '6713', '6713', '6713', '6713', '4004', '4004', '4004', '4004', '4004', '4004'
df['date'] = '2/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '2/03/2023', '2/03/2023', '2/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '27/02/2023', '27/02/2023', '27/02/2023', '28/02/2023', '3/03/2023', '3/03/2023', '3/03/2023', '3/03/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '5/03/2023', '2/03/2023', '2/03/2023', '2/03/2023', '4/03/2023', '4/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '28/02/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023'
df['start'] = '21:00:00', '21:00:00', '21:00:00', '21:00:00', '6:00:00', '6:00:00', '6:00:00', '14:30:00', '14:30:00', '14:30:00', '15:00:00', '15:00:00', '15:00:00', '21:00:00', '15:00:00', '15:00:00', '15:00:00', '15:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '14:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '14:30:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00'
df['end'] = '7:00:00', '7:00:00', '7:00:00', '7:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '13:45:00', '13:45:00', '15:30:00', '15:15:00', '15:15:00', '13:30:00', '23:00:00', '22:30:00', '22:30:00', '22:30:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '16:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00'
df['hours'] = '9.5', '2.5', '0.5', '6.5', '9.5', '10', '2', '2', '3.5', '2.5', '6', '10', '2', '3', '9.5', '9', '1', '2', '1.5', '1.5', '6.25', '6.25', '8', '1', '6.75', '6', '8', '10', '2', '3', '2', '8', '2', '1.5', '1.75', '8', '8', '8', '1.25', '6.75', '1.25', '4.75', '2'
df['day'] = 'Thursday', 'Saturday', 'Saturday', 'Saturday/Sunday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Friday', 'Friday/Saturday', 'Friday', 'Friday', 'Friday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday', 'Sunday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Saturday', 'Saturday', 'Sunday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Friday/Saturday', 'Saturday', 'Saturday/Sunday', 'Saturday'
print(df)
请注意,这只是一部分已翻译的内容。如果您需要更多翻译,请提供具体的部分。
英文:
In the following dataframe, each row represents a part of a shift for a day.
import pandas as pd
df = pd.DataFrame()
df['number'] = '169', '169', '169', '169', '2129', '2129', '2129', '2129', '2129', '2129', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '7480', '7480', '7480', '7480', '7480', '7480', '7480', '6713', '6713', '6713', '6713', '6713', '6713', '6713', '6713','4004', '4004', '4004', '4004', '4004', '4004'
df['date'] ='2/03/2023' ,'4/03/2023', '4/03/2023','5/03/2023', '2/03/2023', '2/03/2023', '2/03/2023' ,'4/03/2023', '4/03/2023', '5/03/2023', '27/02/2023', '27/02/2023', '27/02/2023', '28/02/2023', '3/03/2023', '3/03/2023', '3/03/2023', '3/03/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '5/03/2023','2/03/2023', '2/03/2023', '2/03/2023' ,'4/03/2023', '4/03/2023', '4/03/2023', '4/03/2023' ,'5/03/2023','28/02/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023'
df['start'] = '21:00:00' ,'21:00:00', '21:00:00', '21:00:00', '6:00:00', '6:00:00', '6:00:00', '14:30:00', '14:30:00', '14:30:00', '15:00:00', '15:00:00', '15:00:00', '21:00:00', '15:00:00', '15:00:00', '15:00:00', '15:00:00', '7:00:00', '7:00:00','7:00:00' ,'7:00:00', '7:00:00', '7:00:00', '14:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00','6:30:00', '6:30:00', '6:30:00','14:30:00','22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00'
df['end'] = '7:00:00', '7:00:00', '7:00:00', '7:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '13:45:00', '13:45:00' ,'15:30:00', '15:15:00', '15:15:00', '13:30:00', '23:00:00', '22:30:00', '22:30:00', '22:30:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00' ,'16:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00'
df['hours'] = '9.5', '2.5', '7', '9.5', '10', '2', '2', '3.5', '2.5', '6', '10', '2', '3', '9.5', '10', '2', '1.5', '1.5', '6.25', '6.25', '8', '1', '6.75', '6', '8', '10', '2', '3', '2', '8', '2', '1.5', '1.75', '8', '8', '8', '8', '6', '2'
df['day'] = 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Friday', 'Friday', 'Friday', 'Friday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday', 'Sunday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Saturday', 'Saturday', 'Sunday','Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday'
print(df)
for each row if a cell in ‘day’ column is either Friday, Saturday or Sunday AND the 'start' column is before midnight AND the 'end' column is after midnight, the row that has the balance of 'hours' that pass through midnight (lets call this row 'old row') I need to do the following. By 'pass through midnight' an example I would refer to is number '169' starts 21:00 on Saturday; as the first row for Saturday was 2.5 hours (ie ends 23:30) it is the second row regarding Saturday for 7 hours which 'passes through midnight'.
-
insert a row (lets call this row 'new row') directly underneath 'old row'
-
'old row' 'hours' data needs to be changed to the remaining that the entire shift for the day has to reach midnight. (e.g. for 'Aberash' out of the 7 hours only 0.5 hours is required to reach midnight (as previous row ended 23:30)
-
The new row receives the balance of 'hours' (i.e.. 7 hours - 0.5 hours = 6.5 hours)
-
The new row 'day' column data is changed to the following:
i) If 'old row' was Friday new row is Friday/Saturday
ii) If 'old row' was Saturday new row is Saturday/Sunday
iii) If 'old row' was Sunday new row is Sunday/Monday
Other data not discussed in 'new row' should be same 'old row'.
Below is what the dataframe should look like after the above
import pandas as pd
df = pd.DataFrame()
df['number'] = '169', '169', '169', '169', '169', '2129', '2129', '2129', '2129', '2129', '2129', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '7480', '7480', '7480', '7480', '7480', '7480', '7480', '6713', '6713', '6713', '6713', '6713', '6713', '6713', '6713','4004', '4004','4004', '4004', '4004', '4004', '4004', '4004'
df['date'] = '2/03/2023','4/03/2023','4/03/2023', '4/03/2023','5/03/2023', '2/03/2023', '2/03/2023', '2/03/2023' ,'4/03/2023', '4/03/2023', '5/03/2023', '27/02/2023', '27/02/2023', '27/02/2023', '28/02/2023', '3/03/2023', '3/03/2023', '3/03/2023', '3/03/2023', '3/03/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '5/03/2023','2/03/2023', '2/03/2023', '2/03/2023' ,'4/03/2023', '4/03/2023', '4/03/2023', '4/03/2023' ,'5/03/2023','28/02/2023', '1/03/2023', '2/03/2023', '3/03/2023', '3/03/2023', '4/03/2023', '4/03/2023', '4/03/2023'
df['start'] = '21:00:00','21:00:00','21:00:00', '21:00:00', '21:00:00', '6:00:00', '6:00:00', '6:00:00', '14:30:00', '14:30:00', '14:30:00', '15:00:00', '15:00:00', '15:00:00', '21:00:00', '15:00:00', '15:00:00', '15:00:00', '15:00:00','15:00:00', '7:00:00', '7:00:00','7:00:00' ,'7:00:00', '7:00:00', '7:00:00', '14:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00','6:30:00', '6:30:00', '6:30:00','14:30:00','22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00'
df['end'] = '7:00:00', '7:00:00','7:00:00', '7:00:00', '7:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '13:45:00', '13:45:00' ,'15:30:00', '15:15:00', '15:15:00', '13:30:00', '23:00:00', '22:30:00', '22:30:00', '22:30:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00' ,'16:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00'
df['hours'] = '9.5', '2.5', '0.5', '6.5', '9.5', '10', '2', '2', '3.5', '2.5', '6', '10', '2', '3', '9.5', '9', '1', '2', '1.5', '1.5', '6.25', '6.25', '8', '1', '6.75', '6', '8', '10', '2', '3', '2', '8', '2', '1.5', '1.75', '8', '8', '8', '1.25', '6.75', '1.25', '4.75', '2'
df['day'] = 'Thursday', 'Saturday', 'Saturday', 'Saturday/Sunday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Friday', 'Friday/Saturday', 'Friday', 'Friday', 'Friday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday', 'Sunday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Saturday', 'Saturday', 'Sunday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Friday/Saturday', 'Saturday', 'Saturday/Sunday', 'Saturday'
print(df)
答案1
得分: 1
这个解决方案很丑陋、慢,总体上难以管理。
编写它让我感到非常愤怒。
这是为什么选择正确的数据结构能够导致更清晰代码的明证。不管怎样,这是代码:
import pandas as pd
from datetime import timedelta
df = pd.DataFrame()
df['number'] = '169', '169', '169', '169', '2129', '2129', '2129', '2129', '2129', '2129', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '7480', '7480', '7480', '7480', '7480', '7480', '7480', '6713', '6713', '6713', '6713', '6713', '6713', '6713', '6713', '4004', '4004', '4004', '4004', '4004', '4004'
df['date'] = '2/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '2/03/2023', '2/03/2023', '2/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '27/02/2023', '27/02/2023', '27/02/2023', '28/02/2023', '3/03/2023', '3/03/2023', '3/03/2023', '3/03/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '5/03/2023', '2/03/2023', '2/03/2023', '2/03/2023', '4/03/2023', '4/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '28/02/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023'
df['start'] = '21:00:00', '21:00:00', '21:00:00', '21:00:00', '6:00:00', '6:00:00', '6:00:00', '14:30:00', '14:30:00', '14:30:00', '15:00:00', '15:00:00', '15:00:00', '21:00:00', '15:00:00', '15:00:00', '15:00:00', '15:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '14:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00', '14:30:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00'
df['end'] = '7:00:00', '7:00:00', '7:00:00', '7:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '13:45:00', '13:45:00', '15:30:00', '15:15:00', '15:15:00', '13:30:00', '23:00:00', '22:30:00', '22:30:00', '22:30:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '16:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00', '7:15:00'
df['hours'] = '9.5', '2.5', '7', '9.5', '10', '2', '2', '3.5', '2.5', '6', '10', '2', '3', '9.5', '10', '2', '1.5', '1.5', '6.25', '6.25', '8', '1', '6.75', '6', '8', '10', '2', '3', '2', '8', '2', '1.5', '1.75', '8', '8', '8', '8', '6', '2'
df['day'] = 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Friday', 'Friday', 'Friday', 'Friday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday', 'Sunday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Saturday', 'Saturday', 'Sunday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday'
# 转换为更合理的数据类型
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['hours'] = df['hours'].astype(float)
# df_okay 将存储无问题的班次
# 任何在周五、周六和周日之外的班次不是我们关心的
df_okay = df[~df['day'].isin(['Friday', 'Saturday', 'Sunday'])].copy()
df = df[df['day'].isin(['Friday', 'Saturday', 'Sunday'])].copy()
# 让我们使用日期时间对象来比
<details>
<summary>英文:</summary>
This solution is ugly, slow, and generally unmanageable.
Writing it made me genuinely angry.
This is a testament to why choosing the right data structures leads to cleaner code. Anyway, here it is:
import pandas as pd
from datetime import timedelta
df = pd.DataFrame()
df['number'] = '169', '169', '169', '169', '2129', '2129', '2129', '2129', '2129', '2129', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '5645', '7480', '7480', '7480', '7480', '7480', '7480', '7480', '6713', '6713', '6713', '6713', '6713', '6713', '6713', '6713','4004', '4004', '4004', '4004', '4004', '4004'
df['date'] ='2/03/2023' ,'4/03/2023', '4/03/2023','5/03/2023', '2/03/2023', '2/03/2023', '2/03/2023' ,'4/03/2023', '4/03/2023', '5/03/2023', '27/02/2023', '27/02/2023', '27/02/2023', '28/02/2023', '3/03/2023', '3/03/2023', '3/03/2023', '3/03/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023', '5/03/2023', '5/03/2023','2/03/2023', '2/03/2023', '2/03/2023' ,'4/03/2023', '4/03/2023', '4/03/2023', '4/03/2023' ,'5/03/2023','28/02/2023', '1/03/2023', '2/03/2023', '3/03/2023', '4/03/2023', '4/03/2023'
df['start'] = '21:00:00' ,'21:00:00', '21:00:00', '21:00:00', '6:00:00', '6:00:00', '6:00:00', '14:30:00', '14:30:00', '14:30:00', '15:00:00', '15:00:00', '15:00:00', '21:00:00', '15:00:00', '15:00:00', '15:00:00', '15:00:00', '7:00:00', '7:00:00','7:00:00' ,'7:00:00', '7:00:00', '7:00:00', '14:30:00', '6:30:00', '6:30:00', '6:30:00', '6:30:00','6:30:00', '6:30:00', '6:30:00','14:30:00','22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00', '22:45:00'
df['end'] = '7:00:00', '7:00:00', '7:00:00', '7:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '7:00:00', '13:45:00', '13:45:00' ,'15:30:00', '15:15:00', '15:15:00', '13:30:00', '23:00:00', '22:30:00', '22:30:00', '22:30:00', '21:00:00', '21:00:00', '21:00:00', '21:00:00' ,'16:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00','7:15:00'
df['hours'] = '9.5', '2.5', '7', '9.5', '10', '2', '2', '3.5', '2.5', '6', '10', '2', '3', '9.5', '10', '2', '1.5', '1.5', '6.25', '6.25', '8', '1', '6.75', '6', '8', '10', '2', '3', '2', '8', '2', '1.5', '1.75', '8', '8', '8', '8', '6', '2'
df['day'] = 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Sunday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Friday', 'Friday', 'Friday', 'Friday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday', 'Sunday', 'Sunday', 'Thursday', 'Thursday', 'Thursday', 'Saturday', 'Saturday', 'Saturday', 'Saturday', 'Sunday','Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Saturday'
converting to more sensible data types
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['hours'] = df['hours'].astype(float)
df_okay will be the sink for the non-problematic shifts
any shift outside of Friday, Saturday and Sunday aren't our concern
df_okay = df[~df['day'].isin(['Friday', 'Saturday','Sunday'])].copy()
df = df[df['day'].isin(['Friday', 'Saturday','Sunday'])].copy()
let's use datetime objects to compare dates and calculate durations
df['start_dt'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-%d ') + df['start'])
df['end_dt'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-%d ') + df['end'])
let's add a day for the end datetime of shifts which go over midnight
df.loc[df['start_dt']>df['end_dt'], 'end_dt'] += timedelta(1)
any shift which doesn't go over midnight can't be problematic
so it goes in the sink
df_okay = pd.concat((df_okay, df[df['start_dt'].dt.date == df['end_dt'].dt.date]))
these shifts go over midnight so they're at risk
df = df[df['start_dt'].dt.date != df['end_dt'].dt.date]
let's calculate the total number of hours for each number
during the shift
df_problem_shifts = df[['number', 'start_dt', 'end_dt', 'hours']].groupby(['number','start_dt', 'end_dt']).sum().reset_index().rename(columns={'hours':'total_hours'})
let's calculate the number of available hours between the beginning of the shift and midnight
df_problem_shifts['to_midnight'] = (pd.to_datetime(df_problem_shifts['start_dt'].dt.date + timedelta(days=1)) - df_problem_shifts['start_dt'] ).dt.total_seconds()/3600
these shifts have a sum of hours
which is more than the available hours until midnight
df_problem_shifts = df_problem_shifts[df_problem_shifts['to_midnight'] < df_problem_shifts['total_hours']]
simple dictionnary which will be used to replace day
with day/day+1
days = ['Friday', 'Saturday', 'Sunday', 'Monday']
days = {day: f"{day}/{day2}" for (day, day2) in zip(days, days[1:])}
new_rows = []
for i,row in df_problem_shifts.iterrows():
df_hours = df.loc[(df.number == row.number) & (df.start_dt == row.start_dt) & (df.end_dt == row.end_dt)]
available_hours = row.to_midnight
for j,row2 in df_hours.iterrows():
if available_hours == 0:
df.loc[j,'day'] = days[df.loc[j, 'day']]
else:
available_hours -= row2.hours
if available_hours < 0:
df.loc[j, 'hours'] = row2.hours + available_hours
row2.day = days[row2.day]
row2.hours = - available_hours
available_hours = 0
new_rows.append(row2)
new_rows = pd.DataFrame(new_rows)
at this point df contains the shift rows for all
shifts which had to be modified
df = pd.concat((df, new_rows))
putting back all the shifts which were
previously declared non-problematic
df = pd.concat((df_okay,df)).drop(['start_dt', 'end_dt'], axis=1)
df.sort_values(['number', 'date'])
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论