英文:
Repeat rows in DataFrame with respect to column
问题
我有一个 Pandas DataFrame,看起来像这样:
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
df
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
我想创建一个 Pandas DataFrame,如下所示:
df_new
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
是否有内置的或组合的 Pandas 方法可以实现这个目标?
即使在 df
中有重复值,我希望输出的格式仍然相同。换句话说:
df
col1 col2 col3
0 1 4 7
1 2 5 8
2 2 6 8
df_new
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 8
3 2 4 7
4 2 5 8
5 2 6 8
6 2 4 7
7 2 5 8
8 2 6 8
英文:
I have a Pandas DataFrame that looks like this:
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
df
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
I would like to create a Pandas DataFrame like this:
df_new
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
Is there built-in or combination of built-in Pandas methods that can achieve this?
Even if there are duplicates in df
, I would like the output to be the same format. In other words:
df
col1 col2 col3
0 1 4 7
1 2 5 8
2 2 6 8
df_new
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 8
3 2 4 7
4 2 5 8
5 2 6 8
6 2 4 7
7 2 5 8
8 2 6 8
答案1
得分: 8
import pandas as pd
import numpy as np
n=3
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
# Edited and added this new method.
df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))
""" Old method.
for col in df.columns[1:]:
df2[col] = np.tile(df[col].values, n)
"""
print(df2)
英文:
I would love to see a more pythonic or a 'pandas-exclusive' answer, but this one also works good!
import pandas as pd
import numpy as np
n=3
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
# Edited and added this new method.
df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))
""" Old method.
for col in df.columns[1:]:
df2[col] = np.tile(df[col].values, n)
"""
print(df2)
答案2
得分: 7
以下是翻译好的部分:
"我也会选择使用交叉合并,正如@Henry在评论中建议的那样:
out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)
输出:
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
不同方法的比较:
请注意,@sammywemmy的方法在存在重复行时表现不同,导致无法进行可比较的时间测试。"
英文:
I would also have gone for a cross merge
as suggested by @Henry in comments:
out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)
Output:
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
Comparison of the different approaches:
Note that @sammywemmy's approach behaves differently when rows are duplicated, which leads to a non comparable timing.
答案3
得分: 6
以下是您要的翻译:
您可以通过将数据框的副本连接在一起,将其中的 col1
替换为 col1
中的每个值:
out = df.drop('col1', axis=1)
out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)
输出结果:
col2 col3 col1
0 4 7 1
1 5 8 1
2 6 9 1
3 4 7 2
4 5 8 2
5 6 9 2
6 4 7 3
7 5 8 3
8 6 9 3
如果您愿意,您可以使用以下方法将列重新排序为原始顺序:
out = out[['col1', 'col2', 'col3']]
英文:
You could concatenate copies of the dataframe, with col1
replaced in each copy by each of the values in col1
:
out = df.drop('col1', axis=1)
out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)
Output:
col2 col3 col1
0 4 7 1
1 5 8 1
2 6 9 1
3 4 7 2
4 5 8 2
5 6 9 2
6 4 7 3
7 5 8 3
8 6 9 3
If you prefer, you can then re-order the columns back to the original using
out = out[['col1', 'col2', 'col3']]
答案4
得分: 6
以下是翻译好的内容:
你可以使用 np.repeat
和 np.tile
来获得期望的输出:
import numpy as np
N = 3
cols_to_repeat = ['col1'] # 1, 1, 1, 2, 2, 2
cols_to_tile = ['col2', 'col3'] # 1, 2, 1, 2, 1, 2
data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]
输出:
>>> out
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
你可以创建一个通用的函数:
def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()
assert df.columns.difference(to_repeat + to_tile).empty, "所有列应该被重复或平铺"
data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
np.repeat(df[to_repeat].values, N, axis=0)], axis=1)
return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]
repeat(df, ['col1'])
用法:
>>> repeat(df, ['col1'])
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
英文:
You can use np.repeat
and np.tile
to get the expected output:
import numpy as np
N = 3
cols_to_repeat = ['col1'] # 1, 1, 1, 2, 2, 2
cols_to_tile = ['col2', 'col3'] # 1, 2, 1, 2, 1, 2
data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]
Output:
>>> out
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
You can create a generic function:
def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()
assert df.columns.difference(to_repeat + to_tile).empty, "all columns should be repeated or tiled"
data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
np.repeat(df[to_repeat].values, N, axis=0)], axis=1)
return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]
repeat(df, ['col1'])
Usage:
>>> repeat(df, ['col1'])
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
答案5
得分: 6
另一个可能的解决方案是基于 itertools.product
的:
from itertools import product
pd.DataFrame([[x, y[0], y[1]] for x, y in
product(df['col1'], zip(df['col2'], df['col3']))],
columns=df.columns)
输出:
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
英文:
Another possible solution, which is based on itertools.product
:
from itertools import product
pd.DataFrame([[x, y[0], y[1]] for x, y in
product(df['col1'], zip(df['col2'], df['col3']))],
columns=df.columns)
Output:
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
答案6
得分: 6
# pip install pyjanitor
import janitor
import pandas as pd
df.complete('col1', ('col2', 'col3'))
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
complete主要用于暴露缺失的行 - 上面的输出只是一个不错的附带效果。更合适但相对冗长的选项是使用expand_grid:
# pip install pyjanitor
import janitor as jn
import pandas as pd
others = {'df1': df.col1, 'df2': df[['col2', 'col3']]}
jn.expand_grid(others=others).droplevel(axis=1, level=0)
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 8
3 2 4 7
4 2 5 8
5 2 6 8
6 2 4 7
7 2 5 8
8 2 6 8
英文:
One option is with complete from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
df.complete('col1', ('col2','col3'))
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
3 2 4 7
4 2 5 8
5 2 6 9
6 3 4 7
7 3 5 8
8 3 6 9
complete primarily is for exposing missing rows - the output above just happens to be a nice side effect. A more appropriate, albeit quite verbose option is expand_grid:
# pip install pyjanitor
import janitor as jn
import pandas as pd
others = {'df1':df.col1, 'df2':df[['col2','col3']]}
jn.expand_grid(others=others).droplevel(axis=1,level=0)
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 8
3 2 4 7
4 2 5 8
5 2 6 8
6 2 4 7
7 2 5 8
8 2 6 8
答案7
得分: 0
这是使用带有键的concat
方法的一种方式:
pd.concat([df]*len(df), keys=df.pop('col1')).reset_index(level=0)
输出:
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
0 2 4 7
1 2 5 8
2 2 6 9
0 3 4 7
1 3 5 8
2 3 6 9
英文:
Here is a way using the concat with keys:
pd.concat([df]*len(df),keys = df.pop('col1')).reset_index(level=0)
Output:
col1 col2 col3
0 1 4 7
1 1 5 8
2 1 6 9
0 2 4 7
1 2 5 8
2 2 6 9
0 3 4 7
1 3 5 8
2 3 6 9
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论