英文:
Construct a panel data/time series when knowing only start and end date in Python
问题
不翻译代码部分,只返回翻译好的内容:
假设我拥有表1中的信息,我打算使用Pandas的DataFrame来扩展Table 1以获得Table 2。如果我要使用Pandas来自动化结果,而不需要手动操作,你能解释一下获得Table 2的步骤吗?
欢迎任何建议。
日期以MM/YYYY格式表示。
Table 1
人员 | 公司 | 开始日期 | 结束日期 |
---|---|---|---|
本·先生 | 公司A | 8/1984 | 10/1984 |
公司B | 1/1985 | 3/1985 |
期望的结果如下。
Table 2
人员 | 年份 | 公司 |
---|---|---|
本·先生 | 8/1984 | 公司A |
9/1984 | 公司A | |
10/1984 | 公司A | |
11/1984 | 失业 | |
12/1984 | 失业 | |
1/1985 | 公司B | |
2/1985 | 公司B | |
3/1985 | 公司B |
英文:
Suppose I have the information in Table 1,I intend to use Pandas's DataFrame to expand Table 1 to Table 2. If I use Pandas to automate the results without manual manipulation, can you explain the procedure to obtain Table 2?
Any suggestion welcome.
Date is in MM/YYYY format.
Table 1
Person | Company | Begin date | End Date |
---|---|---|---|
Mr. Bun | Company A | 8/1984 | 10/1984 |
Company B | 1/1985 | 3/1985 |
The expected results look like this.
Table 2
Person | Year | Company |
---|---|---|
Mr. Bun | 8/1984 | Company A |
9/1984 | Company A | |
10/1984 | Company A | |
11/1984 | Unemployed | |
12/1984 | Unemployed | |
1/1985 | Company B | |
2/1985 | Company B | |
3/1985 | Company B |
答案1
得分: 1
以下是您要翻译的代码部分:
#forward filling missing values
df['Person'] = df['Person'].ffill()
#convert values to months periods
df['Begin date'] = pd.to_datetime(df['Begin date']).dt.to_period('m')
df['End Date'] = pd.to_datetime(df['End Date']).dt.to_period('m')
#repeat indices for difference of End and Begin months periods
df1 = df.loc[df.index.repeat(df['End Date'].astype(int)
.sub(df['Begin date'].astype(int)).add(1))]
#add counter to Begin date
df1['Year'] = df1['Begin date'].add(df1.groupby(level=0).cumcount())
#add Unemployed values for missing months
f = lambda x: x.reindex(pd.period_range(x.index.min(), x.index.max(),
freq='m', name='Year'), fill_value='Unemployed')
df1 = df1.set_index('Year').groupby('Person')['Company'].apply(f).reset_index()
#original format MM/YYYY
df1['Year'] = df1['Year'].dt.strftime('%m/%Y')
print (df1)
Person Year Company
0 Mr. Bun 08/1984 Company A
1 Mr. Bun 09/1984 Company A
2 Mr. Bun 10/1984 Company A
3 Mr. Bun 11/1984 Unemployed
4 Mr. Bun 12/1984 Unemployed
5 Mr. Bun 01/1985 Company B
6 Mr. Bun 02/1985 Company B
7 Mr. Bun 03/1985 Company B
英文:
Solution for possible multiple Person
s in column Person
:
df = pd.DataFrame([{'Person': 'Mr. Bun', 'Company': 'Company A',
'Begin date': '8/1984', 'End Date': '10/1984'},
{'Person': np.nan, 'Company': 'Company B',
'Begin date': '1/1985', 'End Date': '3/1985'}])
print (df)
Person Company Begin date End Date
0 Mr. Bun Company A 8/1984 10/1984
1 NaN Company B 1/1985 3/1985
#forward filling missing values
df['Person'] = df['Person'].ffill()
#convert values to months periods
df['Begin date'] = pd.to_datetime(df['Begin date']).dt.to_period('m')
df['End Date'] = pd.to_datetime(df['End Date']).dt.to_period('m')
#repeat indices for difference of End and Begin months periods
df1 = df.loc[df.index.repeat(df['End Date'].astype(int)
.sub(df['Begin date'].astype(int)).add(1))]
#add counter to Begin date
df1['Year'] = df1['Begin date'].add(df1.groupby(level=0).cumcount())
#add Unemployed values for missing months
f = lambda x: x.reindex(pd.period_range(x.index.min(), x.index.max(),
freq='m', name='Year'), fill_value='Unemployed')
df1 = df1.set_index('Year').groupby('Person')['Company'].apply(f).reset_index()
#original format MM/YYYY
df1['Year'] = df1['Year'].dt.strftime('%m/%Y')
print (df1)
Person Year Company
0 Mr. Bun 08/1984 Company A
1 Mr. Bun 09/1984 Company A
2 Mr. Bun 10/1984 Company A
3 Mr. Bun 11/1984 Unemployed
4 Mr. Bun 12/1984 Unemployed
5 Mr. Bun 01/1985 Company B
6 Mr. Bun 02/1985 Company B
7 Mr. Bun 03/1985 Company B
答案2
得分: 1
基本上,您可以使用以下代码来加快处理速度:
数据:
df = pd.DataFrame({'name':['Bun', 'Bun'],
'comp':['A', 'B'],
'start': pd.to_datetime(['1984-08-31', '1985-01-31']),
'end': pd.to_datetime(['1984-10-31', '1985-03-31'])})
创建一个新的数据框以填充所有日期:
new = df[['name', 'start']].set_index(['name', 'start'])
mux = pd.MultiIndex.from_product([new.index.levels[0], pd.date_range(start='1984-08-31', end='1985-03-31', freq='M')], names=['name', 'date'])
new.reindex(mux).reset_index()
这将产生以下输出:
name date
0 Bun 1984-08-31
1 Bun 1984-09-30
2 Bun 1984-10-31
3 Bun 1984-11-30
4 Bun 1984-12-31
5 Bun 1985-01-31
6 Bun 1985-02-28
7 Bun 1985-03-31
之后,您可以与 df
进行合并,然后删除不必要的行。
英文:
Basically you can use something like this for faster processing:
Data:
df = pd.DataFrame({'name':['Bun', 'Bun'],
'comp':['A', 'B'],
'start': pd.to_datetime(['1984-08-31', '1985-01-31']),
'end':pd.to_datetime(['1984-10-31', '1985-03-31'])})
Create new
dataframe to fill all date
new = df[['name', 'start']].set_index(['name', 'start'])
mux = pd.MultiIndex.from_product([new.index.levels[0], pd.date_range(start='1984-08-31', end='1985-03-31', freq='M')], names=['name', 'date'])
new.reindex(mux).reset_index()
Which should give output as:
name date
0 Bun 1984-08-31
1 Bun 1984-09-30
2 Bun 1984-10-31
3 Bun 1984-11-30
4 Bun 1984-12-31
5 Bun 1985-01-31
6 Bun 1985-02-28
7 Bun 1985-03-31
After that, you can merge
with df
. Then drop unnecessary rows
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论