2023年2月6日 14:46:10go评论82阅读模式

英文:

Construct a panel data/time series when knowing only start and end date in Python

问题

不翻译代码部分，只返回翻译好的内容：

假设我拥有表1中的信息，我打算使用Pandas的DataFrame来扩展Table 1以获得Table 2。如果我要使用Pandas来自动化结果，而不需要手动操作，你能解释一下获得Table 2的步骤吗？

欢迎任何建议。

日期以MM/YYYY格式表示。

Table 1

人员	公司	开始日期	结束日期
本·先生	公司A	8/1984	10/1984
	公司B	1/1985	3/1985

期望的结果如下。

Table 2

人员	年份	公司
本·先生	8/1984	公司A
	9/1984	公司A
	10/1984	公司A
	11/1984	失业
	12/1984	失业
	1/1985	公司B
	2/1985	公司B
	3/1985	公司B

英文:

Suppose I have the information in Table 1,I intend to use Pandas's DataFrame to expand Table 1 to Table 2. If I use Pandas to automate the results without manual manipulation, can you explain the procedure to obtain Table 2?

Any suggestion welcome.

Date is in MM/YYYY format.

Table 1

Person	Company	Begin date	End Date
Mr. Bun	Company A	8/1984	10/1984
	Company B	1/1985	3/1985

The expected results look like this.

Table 2

Person	Year	Company
Mr. Bun	8/1984	Company A
	9/1984	Company A
	10/1984	Company A
	11/1984	Unemployed
	12/1984	Unemployed
	1/1985	Company B
	2/1985	Company B
	3/1985	Company B

答案1

得分: 1

以下是您要翻译的代码部分：

#forward filling missing values
df['Person'] = df['Person'].ffill()
#convert values to months periods
df['Begin date'] = pd.to_datetime(df['Begin date']).dt.to_period('m')
df['End Date'] = pd.to_datetime(df['End Date']).dt.to_period('m')
#repeat indices for difference of End and Begin months periods
df1 = df.loc[df.index.repeat(df['End Date'].astype(int)
                     .sub(df['Begin date'].astype(int)).add(1))]
#add counter to Begin date
df1['Year'] = df1['Begin date'].add(df1.groupby(level=0).cumcount())
#add Unemployed values for missing months
f = lambda x: x.reindex(pd.period_range(x.index.min(), x.index.max(), 
                                    freq='m', name='Year'), fill_value='Unemployed')
df1 = df1.set_index('Year').groupby('Person')['Company'].apply(f).reset_index()
#original format MM/YYYY
df1['Year'] = df1['Year'].dt.strftime('%m/%Y')
print (df1)
    Person     Year     Company
0  Mr. Bun  08/1984   Company A
1  Mr. Bun  09/1984   Company A
2  Mr. Bun  10/1984   Company A
3  Mr. Bun  11/1984  Unemployed
4  Mr. Bun  12/1984  Unemployed
5  Mr. Bun  01/1985   Company B
6  Mr. Bun  02/1985   Company B
7  Mr. Bun  03/1985   Company B

英文:

Solution for possible multiple Persons in column Person:

df = pd.DataFrame([{&#39;Person&#39;: &#39;Mr. Bun&#39;, &#39;Company&#39;: &#39;Company A&#39;, 
                    &#39;Begin date&#39;: &#39;8/1984&#39;, &#39;End Date&#39;: &#39;10/1984&#39;}, 
                   {&#39;Person&#39;: np.nan, &#39;Company&#39;: &#39;Company B&#39;,
                    &#39;Begin date&#39;: &#39;1/1985&#39;, &#39;End Date&#39;: &#39;3/1985&#39;}])
    
print (df)
    Person    Company Begin date End Date
0  Mr. Bun  Company A     8/1984  10/1984
1      NaN  Company B     1/1985   3/1985

#forward filling missing values
df[&#39;Person&#39;] = df[&#39;Person&#39;].ffill()
#convert values to months periods
df[&#39;Begin date&#39;] = pd.to_datetime(df[&#39;Begin date&#39;]).dt.to_period(&#39;m&#39;)
df[&#39;End Date&#39;] = pd.to_datetime(df[&#39;End Date&#39;]).dt.to_period(&#39;m&#39;)
#repeat indices for difference of End and Begin months periods
df1 = df.loc[df.index.repeat(df[&#39;End Date&#39;].astype(int)
                     .sub(df[&#39;Begin date&#39;].astype(int)).add(1))]
#add counter to Begin date
df1[&#39;Year&#39;] = df1[&#39;Begin date&#39;].add(df1.groupby(level=0).cumcount())
#add Unemployed values for missing months
f = lambda x: x.reindex(pd.period_range(x.index.min(), x.index.max(), 
                                        freq=&#39;m&#39;, name=&#39;Year&#39;), fill_value=&#39;Unemployed&#39;)
df1 = df1.set_index(&#39;Year&#39;).groupby(&#39;Person&#39;)[&#39;Company&#39;].apply(f).reset_index()
#original format MM/YYYY
df1[&#39;Year&#39;] = df1[&#39;Year&#39;].dt.strftime(&#39;%m/%Y&#39;)
print (df1)
    Person     Year     Company
0  Mr. Bun  08/1984   Company A
1  Mr. Bun  09/1984   Company A
2  Mr. Bun  10/1984   Company A
3  Mr. Bun  11/1984  Unemployed
4  Mr. Bun  12/1984  Unemployed
5  Mr. Bun  01/1985   Company B
6  Mr. Bun  02/1985   Company B
7  Mr. Bun  03/1985   Company B

答案2

得分: 1

基本上，您可以使用以下代码来加快处理速度：

数据：

df = pd.DataFrame({'name':['Bun', 'Bun'],
                   'comp':['A', 'B'],
                   'start': pd.to_datetime(['1984-08-31', '1985-01-31']),
                   'end': pd.to_datetime(['1984-10-31', '1985-03-31'])})

创建一个新的数据框以填充所有日期：

new = df[['name', 'start']].set_index(['name', 'start'])
mux = pd.MultiIndex.from_product([new.index.levels[0], pd.date_range(start='1984-08-31', end='1985-03-31', freq='M')], names=['name', 'date'])
new.reindex(mux).reset_index()

这将产生以下输出：

      name       date
0  Bun 1984-08-31
1  Bun 1984-09-30
2  Bun 1984-10-31
3  Bun 1984-11-30
4  Bun 1984-12-31
5  Bun 1985-01-31
6  Bun 1985-02-28
7  Bun 1985-03-31

之后，您可以与 df 进行合并，然后删除不必要的行。

英文:

Basically you can use something like this for faster processing:

Data:

df = pd.DataFrame({&#39;name&#39;:[&#39;Bun&#39;, &#39;Bun&#39;],
                   &#39;comp&#39;:[&#39;A&#39;, &#39;B&#39;],
                   &#39;start&#39;: pd.to_datetime([&#39;1984-08-31&#39;, &#39;1985-01-31&#39;]),
                   &#39;end&#39;:pd.to_datetime([&#39;1984-10-31&#39;, &#39;1985-03-31&#39;])})

Create new dataframe to fill all date

new = df[[&#39;name&#39;, &#39;start&#39;]].set_index([&#39;name&#39;, &#39;start&#39;])
mux = pd.MultiIndex.from_product([new.index.levels[0], pd.date_range(start=&#39;1984-08-31&#39;, end=&#39;1985-03-31&#39;, freq=&#39;M&#39;)], names=[&#39;name&#39;, &#39;date&#39;])
new.reindex(mux).reset_index()

Which should give output as:

  name       date
0  Bun 1984-08-31
1  Bun 1984-09-30
2  Bun 1984-10-31
3  Bun 1984-11-30
4  Bun 1984-12-31
5  Bun 1985-01-31
6  Bun 1985-02-28
7  Bun 1985-03-31

After that, you can merge with df. Then drop unnecessary rows

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Construct a panel data/time series when knowing only start and end date in Python

问题

答案1

答案2

pandas：根据条件筛选整个分组。

如何在Python 3.x中使用pyparsing从Oracle SQL脚本中删除注释？

使用条件对一列进行汇总，并返回一个新行，其中包含汇总后的值。

NameError: name ‘pd’ is defined 当加载一个pickle文件时 – 但pandas已经定义

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。