将不同长度的列表转换为一列中的行,使用pandas数据框。

huangapple go评论69阅读模式
英文:

Convert different length list in pandas dataframe to row in one column

问题

我有一个像这样的pandas表,日期始终是星期五,但由于节假日或其他原因可能不连续,在“Performance”列中是一个包含下周表现的列表,最后一行列表的长度可能小于5,因为今天是星期三,所以本周只有星期一和星期二的数据:

| 日期        | 表现               |
| 2022/01/27  | [0.1,0.1,0.2,0.1,0.3] |
| 2022/02/10  | [0.1,0.1,0.2,0.1,0.3] |
| 2022/02/17  | [0.1,0.1,0.2,0.1,0.3] |
| 2022/02/24  | [0.1,0.1]           |

我想将这个表转换成一个日期/表现的二维表,其中包含实际表现日期和每天的表现:

| 日期        | 表现 |
| 2022/01/30  | 0.1 |
| 2022/01/31  | 0.1 |
| 2022/02/01  | 0.2 |
| 2022/02/02  | 0.1 |
| 2022/02/03  | 0.3 |
| 2022/02/13  | 0.1 |
| 2022/02/14  | 0.1 |
| 2022/02/15  | 0.2 |
| ...         | ... |
| 2022/02/27  | 0.1 |
| 2022/02/28  | 0.1 |

在Python中,你可以使用以下方法来实现:

import pandas as pd
from datetime import datetime, timedelta

# 创建原始数据表
data = {
    'Date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
    'Performance': [[0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1]]
}
df = pd.DataFrame(data)

# 创建一个新的DataFrame以包含日期和性能
new_data = {'Date': [], 'Performance': []}

for index, row in df.iterrows():
    date = datetime.strptime(row['Date'], '%Y/%m/%d')
    performance = row['Performance']
    performance_len = len(performance)
    
    # 计算每一天的表现
    for i in range(performance_len):
        new_date = date + timedelta(days=i)
        new_data['Date'].append(new_date.strftime('%Y/%m/%d'))
        new_data['Performance'].append(performance[i])

# 创建新的DataFrame
new_df = pd.DataFrame(new_data)

# 打印结果
print(new_df)

这段代码将原始表转换成了你想要的日期/性能的二维表。

英文:

I have a table like this in pandas, the date is always Friday but it could be not continuous due to holidays or other reasons, and in Target, it is a list that contains the performance of next week, the length of the list in the last row could be <5 because today is Wednesday, so for this week I only have Monday and Tuesday data:

| Date         | Performance          |
| 2022/01/27   | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/10   | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/17   | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/24   | [0.1,0.1]            |

I want to convert this table to a date/performance 2d table with the date of the actual performance day and the performance of each day:

| Date         | Performance |
| 2022/01/30   |0.1 |
| 2022/01/31   |0.1 |
| 2022/02/01   |0.2 |
| 2022/02/02   |0.1 |
| 2022/02/03   |0.3 |
| 2022/02/13   |0.1 |
| 2022/02/14   |0.1 |
| 2022/02/15   |0.2 |
| ...          |... |
| 2022/02/27   |0.1 |
| 2022/02/28   |0.1 |

How can I do this in python?

I tried to use sum for the list to connect all lists to a 1d array, but it is problem to attach it to the date column.

答案1

得分: 4

以下是使用 df.explode()df.groupby().cumcount() 的方法:

df = df.explode('Performance')
df['Date'] = pd.to_datetime(df['Date']) + pd.to_timedelta(
             df.groupby(level=0).cumcount(), unit='D')

df = df.reset_index(drop=True) 
print(df)


         Date Performance
0  2022-01-27         0.1
1  2022-01-28         0.1
2  2022-01-29         0.2
3  2022-01-30         0.1
4  2022-01-31         0.3
5  2022-02-10         0.1
6  2022-02-11         0.1
7  2022-02-12         0.2
8  2022-02-13         0.1
9  2022-02-14         0.3
10 2022-02-17         0.1
11 2022-02-18         0.1
12 2022-02-19         0.2
13 2022-02-20         0.1
14 2022-02-21         0.3
15 2022-02-24         0.1
16 2022-02-25         0.1
英文:

Here is an approach using df.explode() and df.groupby().cumcount()

df = df.explode(&#39;Performance&#39;)
df[&#39;Date&#39;] = pd.to_datetime(df[&#39;Date&#39;]) + pd.to_timedelta(
             df.groupby(level=0).cumcount(), unit=&#39;D&#39;)

df = df.reset_index(drop=True) 
print(df)


         Date Performance
0  2022-01-27         0.1
1  2022-01-28         0.1
2  2022-01-29         0.2
3  2022-01-30         0.1
4  2022-01-31         0.3
5  2022-02-10         0.1
6  2022-02-11         0.1
7  2022-02-12         0.2
8  2022-02-13         0.1
9  2022-02-14         0.3
10 2022-02-17         0.1
11 2022-02-18         0.1
12 2022-02-19         0.2
13 2022-02-20         0.1
14 2022-02-21         0.3
15 2022-02-24         0.1
16 2022-02-25         0.1

答案2

得分: 1

下面是翻译好的部分:


输入数据

import pandas as pd

data = {
    'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
    'performance': [
        [0.1, 0.1, 0.2, 0.1, 0.3],
        [0.1, 0.1, 0.2, 0.1, 0.3],
        [0.1, 0.1, 0.2, 0.1, 0.3],
        [0.1, 0.1]
    ]
}

df = pd.DataFrame(data)

print(df)
         date                performance
0  2022/01/27  [0.1, 0.1, 0.2, 0.1, 0.3]
1  2022/02/10  [0.1, 0.1, 0.2, 0.1, 0.3]
2  2022/02/17  [0.1, 0.1, 0.2, 0.1, 0.3]
3  2022/02/24                 [0.1, 0.1]

简单解决方案

Jamiu S.提供了比我的原始解决方案更紧凑的解决方案,因此我首先在此处包含了它,还添加了pd.DateOffset()以完全回答问题。

df = df.explode('performance')

df['date'] = pd.to_datetime(df['date']) + pd.DateOffset(days=3) + pd.to_timedelta(
             df.groupby(level=0).cumcount(), unit='D')

df = df.reset_index(drop=True) 
print(df)

输出:

         date performance
0  2022-01-30         0.1
1  2022-01-31         0.1
2  2022-02-01         0.2
3  2022-02-02         0.1
4  2022-02-03         0.3
5  2022-02-13         0.1
6  2022-02-14         0.1
7  2022-02-15         0.2
8  2022-02-16         0.1
9  2022-02-17         0.3
10 2022-02-20         0.1
11 2022-02-21         0.1
12 2022-02-22         0.2
13 2022-02-23         0.1
14 2022-02-24         0.3
15 2022-02-27         0.1
16 2022-02-28         0.1

原始解决方案

考虑以下步骤:

步骤1:将日期转换为datetime

如果尚未这样做,确保date值以datetime对象的形式表示,而不是字符串。可以使用pd.to_datetime()方法来实现这一点。

# 将日期列转换为datetime对象,以便以后进行操作。
df['date'] = pd.to_datetime(df['date'])

print(df)
        date                performance
0 2022-01-27  [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022-02-10  [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022-02-17  [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022-02-24                 [0.1, 0.1]

df.info()的输出:

RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         4 non-null      datetime64[ns]
 1   performance  4 non-null      object        
dtypes: datetime64[ns](1), object(1)
memory usage: 192.0+ bytes

步骤2:添加下周的日期

添加一个新列'start_of_week',表示下周的星期一(即在星期五后的3天)。

为了计算这些日期,可以使用pd.DateOffset(),以将原始日期提前一定数量的天数。

# 创建一个列,表示下周的开始(星期一) - 当前日期(星期五)后的3天
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)

print(df)
        date                performance start_of_week
0 2022-01-27  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-01-30
1 2022-02-10  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-02-13
2 2022-02-17  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-02-20
3 2022-02-24                 [0.1, 0.1]    2022-02-27

步骤3:创建性能表生成器

创建一个可以应用于每一行的函数,以形成一个二维的"性能表"。

pd.date_range()函数可以用来形成与每个性能值对应的连续日期序列。

# 生成一个包含星期日期和性能的子DataFrame
def create_performance_table(r):
    
    # 提取性能日期。


<details>
<summary>英文:</summary>

From what I understand about your description of the DataFrame, its columns represent the following:

* `date`: contains dates which are all consecutive **Fridays**.

* `performance`: contains lists of performances corresponding to consecutive days in the next week (from Monday up to at most Friday), i.e. `3` days after the value in `date`.

And the problem is how to form a DataFrame that has each performance and its corresponding date on a separate row.

---

## Input data
```python
import pandas as pd

data = {
    &#39;date&#39;: [&#39;2022/01/27&#39;, &#39;2022/02/10&#39;, &#39;2022/02/17&#39;, &#39;2022/02/24&#39;],
    &#39;performance&#39;: [
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1]
    ]
}

df = pd.DataFrame(data)

print(df)
         date                performance
0  2022/01/27  [0.1, 0.1, 0.2, 0.1, 0.3]
1  2022/02/10  [0.1, 0.1, 0.2, 0.1, 0.3]
2  2022/02/17  [0.1, 0.1, 0.2, 0.1, 0.3]
3  2022/02/24                 [0.1, 0.1]

Simple solution

Jamiu S. provided a much more compact solution than my original one. So I've include it here first, with the addition of pd.DateOffset() to fully answer the question.

df = df.explode(&#39;performance&#39;)

df[&#39;date&#39;] = pd.to_datetime(df[&#39;Date&#39;]) + pd.DateOffset(days=3) + pd.to_timedelta(
             df.groupby(level=0).cumcount(), unit=&#39;D&#39;) 

df = df.reset_index(drop=True) 
print(df)

Output:

         date performance
0  2022-01-30         0.1
1  2022-01-31         0.1
2  2022-02-01         0.2
3  2022-02-02         0.1
4  2022-02-03         0.3
5  2022-02-13         0.1
6  2022-02-14         0.1
7  2022-02-15         0.2
8  2022-02-16         0.1
9  2022-02-17         0.3
10 2022-02-20         0.1
11 2022-02-21         0.1
12 2022-02-22         0.2
13 2022-02-23         0.1
14 2022-02-24         0.3
15 2022-02-27         0.1
16 2022-02-28         0.1

Original solution

Consider the following steps:

Step 1: Converting dates to datetime

If not done so already, ensure the date values are represented as datetime objects rather than strings. The pd.to_datetime() method can be used to accomplish this.

# Convert the date column to a datetime object, so it can be manipulated later.
df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])

print(df)
        date                performance
0 2022-01-27  [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022-02-10  [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022-02-17  [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022-02-24                 [0.1, 0.1]

Output of df.info():

RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         4 non-null      datetime64[ns]
 1   performance  4 non-null      object        
dtypes: datetime64[ns](1), object(1)
memory usage: 192.0+ bytes

Step 2: Adding the date of next week

Add a new column &#39;start_of_week&#39;, representing the Monday of the next week (3 days after Friday).

To calculate these dates, pd.DateOffset() can be used, to advance the original dates by certain number of days.

# Create a column representing the start of the next week (Monday) - 3 days after the current date (Friday)
df[&#39;start_of_week&#39;] = df[&#39;date&#39;] + pd.DateOffset(days=3)

print(df)
        date                performance start_of_week
0 2022-01-27  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-01-30
1 2022-02-10  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-02-13
2 2022-02-17  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-02-20
3 2022-02-24                 [0.1, 0.1]    2022-02-27

Step 3: Creating a performance table generator

Create a function that can be applied to each row, to form a two-dimensional "performance table" out of it.

The pd.date_range() function can be used to form a sequence of consecutive dates corresponding to each performance value.

# Generates a sub-DataFrame out of a row containing a week-date and performances.
def create_performance_table(r):
    
    # Extract the performance dates.
    perfs = r[&#39;performance&#39;]
    
    # Construct the range of dates corresponding to each of these performances
    dates = pd.date_range(r[&#39;start_of_week&#39;], periods = len(perfs))

    # Create a DataFrame out of these values and return it.
    return pd.DataFrame({&quot;date&quot;: dates, &quot;performance&quot;: perfs})

Step 4: Creating the sub-tables and combining them

Use the newly defined create_performance_table() function to construct the DataFrame representing the whole performance table.

  • The .apply() method applies the function to each row of the DataFrame, and combines them together.

  • Since the resulting sub-tables will be represented as a single Series object, they need to be joined together to form a single DataFrame. The .concat() method can do just that (but the Series must first be converted to a list).

# Apply the performance table generator to every row, storing the results as a Series of sub-DataFrames.
tables = df[[&#39;performance&#39;, &#39;start_of_week&#39;]].apply(create_performance_table, axis=1)

# Concatenate each of these sub-DatFrames to form the final performance table
out_df = pd.concat(tables.tolist(), ignore_index=True)

print(out_df)

Final output:

         date  performance
0  2022-01-30          0.1
1  2022-01-31          0.1
2  2022-02-01          0.2
3  2022-02-02          0.1
4  2022-02-03          0.3
5  2022-02-13          0.1
6  2022-02-14          0.1
7  2022-02-15          0.2
8  2022-02-16          0.1
9  2022-02-17          0.3
10 2022-02-20          0.1
11 2022-02-21          0.1
12 2022-02-22          0.2
13 2022-02-23          0.1
14 2022-02-24          0.3
15 2022-02-27          0.1
16 2022-02-28          0.1

Full code

import pandas as pd

# --- Input data

data = {
    &#39;date&#39;: [&#39;2022/01/27&#39;, &#39;2022/02/10&#39;, &#39;2022/02/17&#39;, &#39;2022/02/24&#39;],
    &#39;performance&#39;: [
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1]
    ]
}

df = pd.DataFrame(data)


# --- Convert dates to datetime

df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])


# --- Add the date of next week

df[&#39;start_of_week&#39;] = df[&#39;date&#39;] + pd.DateOffset(days=3)


# --- Performance table generator

def create_performance_table(r):
    
    perfs = r[&#39;performance&#39;]
    
    dates = pd.date_range(r[&#39;start_of_week&#39;], periods = len(perfs))

    return pd.DataFrame({&quot;date&quot;: dates, &quot;performance&quot;: perfs})


# --- Create the sub-tables and combine them

tables = df[[&#39;performance&#39;, &#39;start_of_week&#39;]].apply(create_performance_table, axis=1)

# The final output
out_df = pd.concat(tables.tolist(), ignore_index=True)

huangapple
  • 本文由 发表于 2023年3月1日 09:23:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75598798.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定