英文:
Convert different length list in pandas dataframe to row in one column
问题
我有一个像这样的pandas表,日期始终是星期五,但由于节假日或其他原因可能不连续,在“Performance”列中是一个包含下周表现的列表,最后一行列表的长度可能小于5,因为今天是星期三,所以本周只有星期一和星期二的数据:
| 日期 | 表现 |
| 2022/01/27 | [0.1,0.1,0.2,0.1,0.3] |
| 2022/02/10 | [0.1,0.1,0.2,0.1,0.3] |
| 2022/02/17 | [0.1,0.1,0.2,0.1,0.3] |
| 2022/02/24 | [0.1,0.1] |
我想将这个表转换成一个日期/表现的二维表,其中包含实际表现日期和每天的表现:
| 日期 | 表现 |
| 2022/01/30 | 0.1 |
| 2022/01/31 | 0.1 |
| 2022/02/01 | 0.2 |
| 2022/02/02 | 0.1 |
| 2022/02/03 | 0.3 |
| 2022/02/13 | 0.1 |
| 2022/02/14 | 0.1 |
| 2022/02/15 | 0.2 |
| ... | ... |
| 2022/02/27 | 0.1 |
| 2022/02/28 | 0.1 |
在Python中,你可以使用以下方法来实现:
import pandas as pd
from datetime import datetime, timedelta
# 创建原始数据表
data = {
'Date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'Performance': [[0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1]]
}
df = pd.DataFrame(data)
# 创建一个新的DataFrame以包含日期和性能
new_data = {'Date': [], 'Performance': []}
for index, row in df.iterrows():
date = datetime.strptime(row['Date'], '%Y/%m/%d')
performance = row['Performance']
performance_len = len(performance)
# 计算每一天的表现
for i in range(performance_len):
new_date = date + timedelta(days=i)
new_data['Date'].append(new_date.strftime('%Y/%m/%d'))
new_data['Performance'].append(performance[i])
# 创建新的DataFrame
new_df = pd.DataFrame(new_data)
# 打印结果
print(new_df)
这段代码将原始表转换成了你想要的日期/性能的二维表。
英文:
I have a table like this in pandas, the date is always Friday but it could be not continuous due to holidays or other reasons, and in Target, it is a list that contains the performance of next week, the length of the list in the last row could be <5 because today is Wednesday, so for this week I only have Monday and Tuesday data:
| Date | Performance |
| 2022/01/27 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/10 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/17 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/24 | [0.1,0.1] |
I want to convert this table to a date/performance 2d table with the date of the actual performance day and the performance of each day:
| Date | Performance |
| 2022/01/30 |0.1 |
| 2022/01/31 |0.1 |
| 2022/02/01 |0.2 |
| 2022/02/02 |0.1 |
| 2022/02/03 |0.3 |
| 2022/02/13 |0.1 |
| 2022/02/14 |0.1 |
| 2022/02/15 |0.2 |
| ... |... |
| 2022/02/27 |0.1 |
| 2022/02/28 |0.1 |
How can I do this in python?
I tried to use sum for the list to connect all lists to a 1d array, but it is problem to attach it to the date column.
答案1
得分: 4
以下是使用 df.explode()
和 df.groupby().cumcount()
的方法:
df = df.explode('Performance')
df['Date'] = pd.to_datetime(df['Date']) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
Date Performance
0 2022-01-27 0.1
1 2022-01-28 0.1
2 2022-01-29 0.2
3 2022-01-30 0.1
4 2022-01-31 0.3
5 2022-02-10 0.1
6 2022-02-11 0.1
7 2022-02-12 0.2
8 2022-02-13 0.1
9 2022-02-14 0.3
10 2022-02-17 0.1
11 2022-02-18 0.1
12 2022-02-19 0.2
13 2022-02-20 0.1
14 2022-02-21 0.3
15 2022-02-24 0.1
16 2022-02-25 0.1
英文:
Here is an approach using df.explode()
and df.groupby().cumcount()
df = df.explode('Performance')
df['Date'] = pd.to_datetime(df['Date']) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
Date Performance
0 2022-01-27 0.1
1 2022-01-28 0.1
2 2022-01-29 0.2
3 2022-01-30 0.1
4 2022-01-31 0.3
5 2022-02-10 0.1
6 2022-02-11 0.1
7 2022-02-12 0.2
8 2022-02-13 0.1
9 2022-02-14 0.3
10 2022-02-17 0.1
11 2022-02-18 0.1
12 2022-02-19 0.2
13 2022-02-20 0.1
14 2022-02-21 0.3
15 2022-02-24 0.1
16 2022-02-25 0.1
答案2
得分: 1
下面是翻译好的部分:
输入数据
import pandas as pd
data = {
'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'performance': [
[0.1, 0.1, 0.2, 0.1, 0.3],
[0.1, 0.1, 0.2, 0.1, 0.3],
[0.1, 0.1, 0.2, 0.1, 0.3],
[0.1, 0.1]
]
}
df = pd.DataFrame(data)
print(df)
date performance
0 2022/01/27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022/02/10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022/02/17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022/02/24 [0.1, 0.1]
简单解决方案
Jamiu S.提供了比我的原始解决方案更紧凑的解决方案,因此我首先在此处包含了它,还添加了pd.DateOffset()
以完全回答问题。
df = df.explode('performance')
df['date'] = pd.to_datetime(df['date']) + pd.DateOffset(days=3) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
输出:
date performance
0 2022-01-30 0.1
1 2022-01-31 0.1
2 2022-02-01 0.2
3 2022-02-02 0.1
4 2022-02-03 0.3
5 2022-02-13 0.1
6 2022-02-14 0.1
7 2022-02-15 0.2
8 2022-02-16 0.1
9 2022-02-17 0.3
10 2022-02-20 0.1
11 2022-02-21 0.1
12 2022-02-22 0.2
13 2022-02-23 0.1
14 2022-02-24 0.3
15 2022-02-27 0.1
16 2022-02-28 0.1
原始解决方案
考虑以下步骤:
步骤1:将日期转换为datetime
如果尚未这样做,确保date
值以datetime
对象的形式表示,而不是字符串。可以使用pd.to_datetime()
方法来实现这一点。
# 将日期列转换为datetime对象,以便以后进行操作。
df['date'] = pd.to_datetime(df['date'])
print(df)
date performance
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022-02-24 [0.1, 0.1]
df.info()
的输出:
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 performance 4 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 192.0+ bytes
步骤2:添加下周的日期
添加一个新列'start_of_week'
,表示下周的星期一(即在星期五后的3天)。
为了计算这些日期,可以使用pd.DateOffset()
,以将原始日期提前一定数量的天数。
# 创建一个列,表示下周的开始(星期一) - 当前日期(星期五)后的3天
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)
print(df)
date performance start_of_week
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-01-30
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-13
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-20
3 2022-02-24 [0.1, 0.1] 2022-02-27
步骤3:创建性能表生成器
创建一个可以应用于每一行的函数,以形成一个二维的"性能表"。
pd.date_range()
函数可以用来形成与每个性能值对应的连续日期序列。
# 生成一个包含星期日期和性能的子DataFrame
def create_performance_table(r):
# 提取性能日期。
<details>
<summary>英文:</summary>
From what I understand about your description of the DataFrame, its columns represent the following:
* `date`: contains dates which are all consecutive **Fridays**.
* `performance`: contains lists of performances corresponding to consecutive days in the next week (from Monday up to at most Friday), i.e. `3` days after the value in `date`.
And the problem is how to form a DataFrame that has each performance and its corresponding date on a separate row.
---
## Input data
```python
import pandas as pd
data = {
'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'performance': [
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1]
]
}
df = pd.DataFrame(data)
print(df)
date performance
0 2022/01/27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022/02/10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022/02/17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022/02/24 [0.1, 0.1]
Simple solution
Jamiu S. provided a much more compact solution than my original one. So I've include it here first, with the addition of pd.DateOffset()
to fully answer the question.
df = df.explode('performance')
df['date'] = pd.to_datetime(df['Date']) + pd.DateOffset(days=3) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
Output:
date performance
0 2022-01-30 0.1
1 2022-01-31 0.1
2 2022-02-01 0.2
3 2022-02-02 0.1
4 2022-02-03 0.3
5 2022-02-13 0.1
6 2022-02-14 0.1
7 2022-02-15 0.2
8 2022-02-16 0.1
9 2022-02-17 0.3
10 2022-02-20 0.1
11 2022-02-21 0.1
12 2022-02-22 0.2
13 2022-02-23 0.1
14 2022-02-24 0.3
15 2022-02-27 0.1
16 2022-02-28 0.1
Original solution
Consider the following steps:
Step 1: Converting dates to datetime
If not done so already, ensure the date
values are represented as datetime
objects rather than strings. The pd.to_datetime()
method can be used to accomplish this.
# Convert the date column to a datetime object, so it can be manipulated later.
df['date'] = pd.to_datetime(df['date'])
print(df)
date performance
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022-02-24 [0.1, 0.1]
Output of df.info()
:
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 performance 4 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 192.0+ bytes
Step 2: Adding the date of next week
Add a new column 'start_of_week'
, representing the Monday of the next week (3
days after Friday).
To calculate these dates, pd.DateOffset()
can be used, to advance the original dates by certain number of days.
# Create a column representing the start of the next week (Monday) - 3 days after the current date (Friday)
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)
print(df)
date performance start_of_week
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-01-30
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-13
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-20
3 2022-02-24 [0.1, 0.1] 2022-02-27
Step 3: Creating a performance table generator
Create a function that can be applied to each row, to form a two-dimensional "performance table" out of it.
The pd.date_range()
function can be used to form a sequence of consecutive dates
corresponding to each performance value.
# Generates a sub-DataFrame out of a row containing a week-date and performances.
def create_performance_table(r):
# Extract the performance dates.
perfs = r['performance']
# Construct the range of dates corresponding to each of these performances
dates = pd.date_range(r['start_of_week'], periods = len(perfs))
# Create a DataFrame out of these values and return it.
return pd.DataFrame({"date": dates, "performance": perfs})
Step 4: Creating the sub-tables and combining them
Use the newly defined create_performance_table()
function to construct the DataFrame representing the whole performance table.
-
The
.apply()
method applies the function to each row of the DataFrame, and combines them together. -
Since the resulting sub-tables will be represented as a single
Series
object, they need to be joined together to form a singleDataFrame
. The.concat()
method can do just that (but theSeries
must first be converted to a list).
# Apply the performance table generator to every row, storing the results as a Series of sub-DataFrames.
tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)
# Concatenate each of these sub-DatFrames to form the final performance table
out_df = pd.concat(tables.tolist(), ignore_index=True)
print(out_df)
Final output:
date performance
0 2022-01-30 0.1
1 2022-01-31 0.1
2 2022-02-01 0.2
3 2022-02-02 0.1
4 2022-02-03 0.3
5 2022-02-13 0.1
6 2022-02-14 0.1
7 2022-02-15 0.2
8 2022-02-16 0.1
9 2022-02-17 0.3
10 2022-02-20 0.1
11 2022-02-21 0.1
12 2022-02-22 0.2
13 2022-02-23 0.1
14 2022-02-24 0.3
15 2022-02-27 0.1
16 2022-02-28 0.1
Full code
import pandas as pd
# --- Input data
data = {
'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'performance': [
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1]
]
}
df = pd.DataFrame(data)
# --- Convert dates to datetime
df['date'] = pd.to_datetime(df['date'])
# --- Add the date of next week
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)
# --- Performance table generator
def create_performance_table(r):
perfs = r['performance']
dates = pd.date_range(r['start_of_week'], periods = len(perfs))
return pd.DataFrame({"date": dates, "performance": perfs})
# --- Create the sub-tables and combine them
tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)
# The final output
out_df = pd.concat(tables.tolist(), ignore_index=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论