2023年5月26日 12:01:39go评论88阅读模式

英文:

Repeat rows in DataFrame with respect to column

问题

我有一个 Pandas DataFrame，看起来像这样：

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})
df
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9

我想创建一个 Pandas DataFrame，如下所示：

df_new
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

是否有内置的或组合的 Pandas 方法可以实现这个目标？

即使在 df 中有重复值，我希望输出的格式仍然相同。换句话说：

df
   col1  col2  col3
0     1     4     7
1     2     5     8
2     2     6     8
df_new
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     8
3     2     4     7
4     2     5     8
5     2     6     8
6     2     4     7
7     2     5     8
8     2     6     8

英文:

I have a Pandas DataFrame that looks like this:

df = pd.DataFrame({&#39;col1&#39;: [1, 2, 3],
                   &#39;col2&#39;: [4, 5, 6],
                   &#39;col3&#39;: [7, 8, 9]})
df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      3       6       9

I would like to create a Pandas DataFrame like this:

df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       9
3      2       4       7
4      2       5       8
5      2       6       9
6      3       4       7
7      3       5       8
8      3       6       9

Is there built-in or combination of built-in Pandas methods that can achieve this?

Even if there are duplicates in df, I would like the output to be the same format. In other words:

df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      2       6       8
df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       8
3      2       4       7
4      2       5       8
5      2       6       8
6      2       4       7
7      2       5       8
8      2       6       8

答案1

得分: 8

import pandas as pd
import numpy as np
n=3
df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})
# Edited and added this new method.
df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))
""" Old method.
for col in df.columns[1:]:
   df2[col] = np.tile(df[col].values, n)
"""
print(df2)

英文:

I would love to see a more pythonic or a 'pandas-exclusive' answer, but this one also works good!

import pandas as pd
import numpy as np
n=3
df = pd.DataFrame({&#39;col1&#39;: [1, 2, 3],
                   &#39;col2&#39;: [4, 5, 6],
                   &#39;col3&#39;: [7, 8, 9]})
# Edited and added this new method.
df2 = pd.DataFrame({df.columns[0]:np.repeat(df[&#39;col1&#39;].values, n)})
df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))
&quot;&quot;&quot; Old method.
for col in df.columns[1:]:
   df2[col] = np.tile(df[col].values, n)
&quot;&quot;&quot;
print(df2)

答案2

得分: 7

以下是翻译好的部分：

"我也会选择使用交叉合并，正如@Henry在评论中建议的那样：

out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)

输出：

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

不同方法的比较：

请注意，@sammywemmy的方法在存在重复行时表现不同，导致无法进行可比较的时间测试。"

英文:

I would also have gone for a cross merge as suggested by @Henry in comments:

out = df[[&#39;col1&#39;]].merge(df[[&#39;col2&#39;, &#39;col3&#39;]], how=&#39;cross&#39;).reset_index(drop=True)

Output:

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

Comparison of the different approaches:

Note that @sammywemmy's approach behaves differently when rows are duplicated, which leads to a non comparable timing.

答案3

得分: 6

以下是您要的翻译：

您可以通过将数据框的副本连接在一起，将其中的 col1 替换为 col1 中的每个值：

out = df.drop('col1', axis=1)
out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)

输出结果：

   col2  col3  col1
0     4     7     1
1     5     8     1
2     6     9     1
3     4     7     2
4     5     8     2
5     6     9     2
6     4     7     3
7     5     8     3
8     6     9     3

如果您愿意，您可以使用以下方法将列重新排序为原始顺序：

out = out[['col1', 'col2', 'col3']]

英文:

You could concatenate copies of the dataframe, with col1 replaced in each copy by each of the values in col1:

out = df.drop(&#39;col1&#39;, axis=1)
out = pd.concat([out.assign(col1=c1) for c1 in df[&#39;col1&#39;]]).reset_index(drop=True)

Output:

   col2  col3  col1
0     4     7     1
1     5     8     1
2     6     9     1
3     4     7     2
4     5     8     2
5     6     9     2
6     4     7     3
7     5     8     3
8     6     9     3

If you prefer, you can then re-order the columns back to the original using

out = out[[&#39;col1&#39;, &#39;col2&#39;, &#39;col3&#39;]]

答案4

得分: 6

以下是翻译好的内容：

你可以使用 np.repeat 和 np.tile 来获得期望的输出：

import numpy as np
N = 3
cols_to_repeat = ['col1']  # 1, 1, 1, 2, 2, 2
cols_to_tile = ['col2', 'col3']  # 1, 2, 1, 2, 1, 2
data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
                       np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]

输出：

>>> out
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

你可以创建一个通用的函数：

def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
    to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()
    assert df.columns.difference(to_repeat + to_tile).empty, "所有列应该被重复或平铺"
    data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
                           np.repeat(df[to_repeat].values, N, axis=0)], axis=1)
    return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]
repeat(df, ['col1'])

用法：

>>> repeat(df, ['col1'])
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

英文:

You can use np.repeat and np.tile to get the expected output:

import numpy as np
N = 3
cols_to_repeat = [&#39;col1&#39;]  # 1, 1, 1, 2, 2, 2
cols_to_tile = [&#39;col2&#39;, &#39;col3&#39;]  # 1, 2, 1, 2, 1, 2
data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
                       np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]

Output:

&gt;&gt;&gt; out
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

You can create a generic function:

def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -&gt; pd.DataFrame:
    to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()
    assert df.columns.difference(to_repeat + to_tile).empty, &quot;all columns should be repeated or tiled&quot;
    data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
                           np.repeat(df[to_repeat].values, N, axis=0)], axis=1)
    return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]
repeat(df, [&#39;col1&#39;])

Usage:

&gt;&gt;&gt; repeat(df, [&#39;col1&#39;])
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

答案5

得分: 6

另一个可能的解决方案是基于 itertools.product 的：

from itertools import product
pd.DataFrame([[x, y[0], y[1]] for x, y in 
              product(df['col1'], zip(df['col2'], df['col3']))], 
             columns=df.columns)

输出：

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

英文:

Another possible solution, which is based on itertools.product:

from itertools import product
pd.DataFrame([[x, y[0], y[1]] for x, y in 
              product(df[&#39;col1&#39;], zip(df[&#39;col2&#39;], df[&#39;col3&#39;]))], 
             columns=df.columns)

Output:

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

答案6

得分: 6

一种选择是使用pyjanitor中的complete函数：

# pip install pyjanitor
import janitor 
import pandas as pd
df.complete('col1', ('col2', 'col3'))
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

complete主要用于暴露缺失的行 - 上面的输出只是一个不错的附带效果。更合适但相对冗长的选项是使用expand_grid：

# pip install pyjanitor
import janitor as jn
import pandas as pd
others = {'df1': df.col1, 'df2': df[['col2', 'col3']]}
jn.expand_grid(others=others).droplevel(axis=1, level=0)
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     8
3     2     4     7
4     2     5     8
5     2     6     8
6     2     4     7
7     2     5     8
8     2     6     8

英文:

One option is with complete from pyjanitor:

# pip install pyjanitor
import janitor 
import pandas as pd
df.complete(&#39;col1&#39;, (&#39;col2&#39;,&#39;col3&#39;))
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

complete primarily is for exposing missing rows - the output above just happens to be a nice side effect. A more appropriate, albeit quite verbose option is expand_grid:

# pip install pyjanitor
import janitor as jn
import pandas as pd
others = {&#39;df1&#39;:df.col1, &#39;df2&#39;:df[[&#39;col2&#39;,&#39;col3&#39;]]}
jn.expand_grid(others=others).droplevel(axis=1,level=0)
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     8
3     2     4     7
4     2     5     8
5     2     6     8
6     2     4     7
7     2     5     8
8     2     6     8

答案7

得分: 0

这是使用带有键的concat方法的一种方式：

pd.concat([df]*len(df), keys=df.pop('col1')).reset_index(level=0)

输出：

       col1  col2  col3
    0     1     4     7
    1     1     5     8
    2     1     6     9
    0     2     4     7
    1     2     5     8
    2     2     6     9
    0     3     4     7
    1     3     5     8
    2     3     6     9

英文:

Here is a way using the concat with keys:

pd.concat([df]*len(df),keys = df.pop(&#39;col1&#39;)).reset_index(level=0)

Output:

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
0     2     4     7
1     2     5     8
2     2     6     9
0     3     4     7
1     3     5     8
2     3     6     9

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Repeat rows in DataFrame with respect to column 重复DataFrame中的行，以列为基准。

问题

答案1

答案2

答案3

答案4

答案5

答案6

答案7

python logging in AWS Fargate, datetime duplicated

将基于行条件设置变量列的值为NaN。

TFIDFVectorizer 制作拼接的单词标记

Python / Rethink DB 追加嵌套数组

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论