Construct a panel data/time series when knowing only start and end date in Python

huangapple go评论82阅读模式
英文:

Construct a panel data/time series when knowing only start and end date in Python

问题

不翻译代码部分,只返回翻译好的内容:

假设我拥有表1中的信息,我打算使用Pandas的DataFrame来扩展Table 1以获得Table 2。如果我要使用Pandas来自动化结果,而不需要手动操作,你能解释一下获得Table 2的步骤吗?

欢迎任何建议。

日期以MM/YYYY格式表示。

Table 1

人员 公司 开始日期 结束日期
本·先生 公司A 8/1984 10/1984
公司B 1/1985 3/1985

期望的结果如下。

Table 2

人员 年份 公司
本·先生 8/1984 公司A
9/1984 公司A
10/1984 公司A
11/1984 失业
12/1984 失业
1/1985 公司B
2/1985 公司B
3/1985 公司B
英文:

Suppose I have the information in Table 1,I intend to use Pandas's DataFrame to expand Table 1 to Table 2. If I use Pandas to automate the results without manual manipulation, can you explain the procedure to obtain Table 2?

Any suggestion welcome.

Date is in MM/YYYY format.

Table 1

Person Company Begin date End Date
Mr. Bun Company A 8/1984 10/1984
Company B 1/1985 3/1985

The expected results look like this.

Table 2

Person Year Company
Mr. Bun 8/1984 Company A
9/1984 Company A
10/1984 Company A
11/1984 Unemployed
12/1984 Unemployed
1/1985 Company B
2/1985 Company B
3/1985 Company B

答案1

得分: 1

以下是您要翻译的代码部分:

  1. #forward filling missing values
  2. df['Person'] = df['Person'].ffill()
  3. #convert values to months periods
  4. df['Begin date'] = pd.to_datetime(df['Begin date']).dt.to_period('m')
  5. df['End Date'] = pd.to_datetime(df['End Date']).dt.to_period('m')
  6. #repeat indices for difference of End and Begin months periods
  7. df1 = df.loc[df.index.repeat(df['End Date'].astype(int)
  8. .sub(df['Begin date'].astype(int)).add(1))]
  9. #add counter to Begin date
  10. df1['Year'] = df1['Begin date'].add(df1.groupby(level=0).cumcount())
  11. #add Unemployed values for missing months
  12. f = lambda x: x.reindex(pd.period_range(x.index.min(), x.index.max(),
  13. freq='m', name='Year'), fill_value='Unemployed')
  14. df1 = df1.set_index('Year').groupby('Person')['Company'].apply(f).reset_index()
  15. #original format MM/YYYY
  16. df1['Year'] = df1['Year'].dt.strftime('%m/%Y')
  17. print (df1)
  18. Person Year Company
  19. 0 Mr. Bun 08/1984 Company A
  20. 1 Mr. Bun 09/1984 Company A
  21. 2 Mr. Bun 10/1984 Company A
  22. 3 Mr. Bun 11/1984 Unemployed
  23. 4 Mr. Bun 12/1984 Unemployed
  24. 5 Mr. Bun 01/1985 Company B
  25. 6 Mr. Bun 02/1985 Company B
  26. 7 Mr. Bun 03/1985 Company B
英文:

Solution for possible multiple Persons in column Person:

  1. df = pd.DataFrame([{'Person': 'Mr. Bun', 'Company': 'Company A',
  2. 'Begin date': '8/1984', 'End Date': '10/1984'},
  3. {'Person': np.nan, 'Company': 'Company B',
  4. 'Begin date': '1/1985', 'End Date': '3/1985'}])
  5. print (df)
  6. Person Company Begin date End Date
  7. 0 Mr. Bun Company A 8/1984 10/1984
  8. 1 NaN Company B 1/1985 3/1985

  1. #forward filling missing values
  2. df['Person'] = df['Person'].ffill()
  3. #convert values to months periods
  4. df['Begin date'] = pd.to_datetime(df['Begin date']).dt.to_period('m')
  5. df['End Date'] = pd.to_datetime(df['End Date']).dt.to_period('m')
  6. #repeat indices for difference of End and Begin months periods
  7. df1 = df.loc[df.index.repeat(df['End Date'].astype(int)
  8. .sub(df['Begin date'].astype(int)).add(1))]
  9. #add counter to Begin date
  10. df1['Year'] = df1['Begin date'].add(df1.groupby(level=0).cumcount())
  11. #add Unemployed values for missing months
  12. f = lambda x: x.reindex(pd.period_range(x.index.min(), x.index.max(),
  13. freq='m', name='Year'), fill_value='Unemployed')
  14. df1 = df1.set_index('Year').groupby('Person')['Company'].apply(f).reset_index()
  15. #original format MM/YYYY
  16. df1['Year'] = df1['Year'].dt.strftime('%m/%Y')
  17. print (df1)
  18. Person Year Company
  19. 0 Mr. Bun 08/1984 Company A
  20. 1 Mr. Bun 09/1984 Company A
  21. 2 Mr. Bun 10/1984 Company A
  22. 3 Mr. Bun 11/1984 Unemployed
  23. 4 Mr. Bun 12/1984 Unemployed
  24. 5 Mr. Bun 01/1985 Company B
  25. 6 Mr. Bun 02/1985 Company B
  26. 7 Mr. Bun 03/1985 Company B

答案2

得分: 1

基本上,您可以使用以下代码来加快处理速度:

数据:

  1. df = pd.DataFrame({'name':['Bun', 'Bun'],
  2. 'comp':['A', 'B'],
  3. 'start': pd.to_datetime(['1984-08-31', '1985-01-31']),
  4. 'end': pd.to_datetime(['1984-10-31', '1985-03-31'])})

创建一个新的数据框以填充所有日期:

  1. new = df[['name', 'start']].set_index(['name', 'start'])
  2. mux = pd.MultiIndex.from_product([new.index.levels[0], pd.date_range(start='1984-08-31', end='1985-03-31', freq='M')], names=['name', 'date'])
  3. new.reindex(mux).reset_index()

这将产生以下输出:

  1. name date
  2. 0 Bun 1984-08-31
  3. 1 Bun 1984-09-30
  4. 2 Bun 1984-10-31
  5. 3 Bun 1984-11-30
  6. 4 Bun 1984-12-31
  7. 5 Bun 1985-01-31
  8. 6 Bun 1985-02-28
  9. 7 Bun 1985-03-31

之后,您可以与 df 进行合并,然后删除不必要的行。

英文:

Basically you can use something like this for faster processing:

Data:

  1. df = pd.DataFrame({'name':['Bun', 'Bun'],
  2. 'comp':['A', 'B'],
  3. 'start': pd.to_datetime(['1984-08-31', '1985-01-31']),
  4. 'end':pd.to_datetime(['1984-10-31', '1985-03-31'])})

Create new dataframe to fill all date

  1. new = df[['name', 'start']].set_index(['name', 'start'])
  2. mux = pd.MultiIndex.from_product([new.index.levels[0], pd.date_range(start='1984-08-31', end='1985-03-31', freq='M')], names=['name', 'date'])
  3. new.reindex(mux).reset_index()

Which should give output as:

  1. name date
  2. 0 Bun 1984-08-31
  3. 1 Bun 1984-09-30
  4. 2 Bun 1984-10-31
  5. 3 Bun 1984-11-30
  6. 4 Bun 1984-12-31
  7. 5 Bun 1985-01-31
  8. 6 Bun 1985-02-28
  9. 7 Bun 1985-03-31

After that, you can merge with df. Then drop unnecessary rows

huangapple
  • 本文由 发表于 2023年2月6日 14:46:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358114.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定