Repeat rows in DataFrame with respect to column 重复DataFrame中的行,以列为基准。

huangapple go评论86阅读模式
英文:

Repeat rows in DataFrame with respect to column

问题

我有一个 Pandas DataFrame,看起来像这样:

  1. df = pd.DataFrame({'col1': [1, 2, 3],
  2. 'col2': [4, 5, 6],
  3. 'col3': [7, 8, 9]})
  4. df
  5. col1 col2 col3
  6. 0 1 4 7
  7. 1 2 5 8
  8. 2 3 6 9

我想创建一个 Pandas DataFrame,如下所示:

  1. df_new
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 1 5 8
  5. 2 1 6 9
  6. 3 2 4 7
  7. 4 2 5 8
  8. 5 2 6 9
  9. 6 3 4 7
  10. 7 3 5 8
  11. 8 3 6 9

是否有内置的或组合的 Pandas 方法可以实现这个目标?

即使在 df 中有重复值,我希望输出的格式仍然相同。换句话说:

  1. df
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 2 5 8
  5. 2 2 6 8
  6. df_new
  7. col1 col2 col3
  8. 0 1 4 7
  9. 1 1 5 8
  10. 2 1 6 8
  11. 3 2 4 7
  12. 4 2 5 8
  13. 5 2 6 8
  14. 6 2 4 7
  15. 7 2 5 8
  16. 8 2 6 8
英文:

I have a Pandas DataFrame that looks like this:

  1. df = pd.DataFrame({'col1': [1, 2, 3],
  2. 'col2': [4, 5, 6],
  3. 'col3': [7, 8, 9]})
  4. df
  5. col1 col2 col3
  6. 0 1 4 7
  7. 1 2 5 8
  8. 2 3 6 9

I would like to create a Pandas DataFrame like this:

  1. df_new
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 1 5 8
  5. 2 1 6 9
  6. 3 2 4 7
  7. 4 2 5 8
  8. 5 2 6 9
  9. 6 3 4 7
  10. 7 3 5 8
  11. 8 3 6 9

Is there built-in or combination of built-in Pandas methods that can achieve this?

Even if there are duplicates in df, I would like the output to be the same format. In other words:

  1. df
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 2 5 8
  5. 2 2 6 8
  6. df_new
  7. col1 col2 col3
  8. 0 1 4 7
  9. 1 1 5 8
  10. 2 1 6 8
  11. 3 2 4 7
  12. 4 2 5 8
  13. 5 2 6 8
  14. 6 2 4 7
  15. 7 2 5 8
  16. 8 2 6 8

答案1

得分: 8

  1. import pandas as pd
  2. import numpy as np
  3. n=3
  4. df = pd.DataFrame({'col1': [1, 2, 3],
  5. 'col2': [4, 5, 6],
  6. 'col3': [7, 8, 9]})
  7. # Edited and added this new method.
  8. df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
  9. df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))
  10. """ Old method.
  11. for col in df.columns[1:]:
  12. df2[col] = np.tile(df[col].values, n)
  13. """
  14. print(df2)
英文:

I would love to see a more pythonic or a 'pandas-exclusive' answer, but this one also works good!

  1. import pandas as pd
  2. import numpy as np
  3. n=3
  4. df = pd.DataFrame({'col1': [1, 2, 3],
  5. 'col2': [4, 5, 6],
  6. 'col3': [7, 8, 9]})
  7. # Edited and added this new method.
  8. df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
  9. df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))
  10. """ Old method.
  11. for col in df.columns[1:]:
  12. df2[col] = np.tile(df[col].values, n)
  13. """
  14. print(df2)

答案2

得分: 7

以下是翻译好的部分:

"我也会选择使用交叉合并,正如@Henry在评论中建议的那样:

  1. out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)

输出:

  1. col1 col2 col3
  2. 0 1 4 7
  3. 1 1 5 8
  4. 2 1 6 9
  5. 3 2 4 7
  6. 4 2 5 8
  7. 5 2 6 9
  8. 6 3 4 7
  9. 7 3 5 8
  10. 8 3 6 9

不同方法的比较:

Repeat rows in DataFrame with respect to column
重复DataFrame中的行,以列为基准。

请注意,@sammywemmy的方法在存在重复行时表现不同,导致无法进行可比较的时间测试。"

英文:

I would also have gone for a cross merge as suggested by @Henry in comments:

  1. out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)

Output:

  1. col1 col2 col3
  2. 0 1 4 7
  3. 1 1 5 8
  4. 2 1 6 9
  5. 3 2 4 7
  6. 4 2 5 8
  7. 5 2 6 9
  8. 6 3 4 7
  9. 7 3 5 8
  10. 8 3 6 9

Comparison of the different approaches:

Repeat rows in DataFrame with respect to column
重复DataFrame中的行,以列为基准。

Note that @sammywemmy's approach behaves differently when rows are duplicated, which leads to a non comparable timing.

答案3

得分: 6

以下是您要的翻译:

您可以通过将数据框的副本连接在一起,将其中的 col1 替换为 col1 中的每个值:

  1. out = df.drop('col1', axis=1)
  2. out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)

输出结果:

  1. col2 col3 col1
  2. 0 4 7 1
  3. 1 5 8 1
  4. 2 6 9 1
  5. 3 4 7 2
  6. 4 5 8 2
  7. 5 6 9 2
  8. 6 4 7 3
  9. 7 5 8 3
  10. 8 6 9 3

如果您愿意,您可以使用以下方法将列重新排序为原始顺序:

  1. out = out[['col1', 'col2', 'col3']]
英文:

You could concatenate copies of the dataframe, with col1 replaced in each copy by each of the values in col1:

  1. out = df.drop('col1', axis=1)
  2. out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)

Output:

  1. col2 col3 col1
  2. 0 4 7 1
  3. 1 5 8 1
  4. 2 6 9 1
  5. 3 4 7 2
  6. 4 5 8 2
  7. 5 6 9 2
  8. 6 4 7 3
  9. 7 5 8 3
  10. 8 6 9 3

If you prefer, you can then re-order the columns back to the original using

  1. out = out[['col1', 'col2', 'col3']]

答案4

得分: 6

以下是翻译好的内容:

你可以使用 np.repeatnp.tile 来获得期望的输出:

  1. import numpy as np
  2. N = 3
  3. cols_to_repeat = ['col1'] # 1, 1, 1, 2, 2, 2
  4. cols_to_tile = ['col2', 'col3'] # 1, 2, 1, 2, 1, 2
  5. data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
  6. np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
  7. out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]

输出:

  1. >>> out
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 1 5 8
  5. 2 1 6 9
  6. 3 2 4 7
  7. 4 2 5 8
  8. 5 2 6 9
  9. 6 3 4 7
  10. 7 3 5 8
  11. 8 3 6 9

你可以创建一个通用的函数:

  1. def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
  2. to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()
  3. assert df.columns.difference(to_repeat + to_tile).empty, "所有列应该被重复或平铺"
  4. data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
  5. np.repeat(df[to_repeat].values, N, axis=0)], axis=1)
  6. return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]
  7. repeat(df, ['col1'])

用法:

  1. >>> repeat(df, ['col1'])
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 1 5 8
  5. 2 1 6 9
  6. 3 2 4 7
  7. 4 2 5 8
  8. 5 2 6 9
  9. 6 3 4 7
  10. 7 3 5 8
  11. 8 3 6 9
英文:

You can use np.repeat and np.tile to get the expected output:

  1. import numpy as np
  2. N = 3
  3. cols_to_repeat = ['col1'] # 1, 1, 1, 2, 2, 2
  4. cols_to_tile = ['col2', 'col3'] # 1, 2, 1, 2, 1, 2
  5. data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
  6. np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
  7. out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]

Output:

  1. >>> out
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 1 5 8
  5. 2 1 6 9
  6. 3 2 4 7
  7. 4 2 5 8
  8. 5 2 6 9
  9. 6 3 4 7
  10. 7 3 5 8
  11. 8 3 6 9

You can create a generic function:

  1. def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
  2. to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()
  3. assert df.columns.difference(to_repeat + to_tile).empty, "all columns should be repeated or tiled"
  4. data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
  5. np.repeat(df[to_repeat].values, N, axis=0)], axis=1)
  6. return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]
  7. repeat(df, ['col1'])

Usage:

  1. >>> repeat(df, ['col1'])
  2. col1 col2 col3
  3. 0 1 4 7
  4. 1 1 5 8
  5. 2 1 6 9
  6. 3 2 4 7
  7. 4 2 5 8
  8. 5 2 6 9
  9. 6 3 4 7
  10. 7 3 5 8
  11. 8 3 6 9

答案5

得分: 6

另一个可能的解决方案是基于 itertools.product 的:

  1. from itertools import product
  2. pd.DataFrame([[x, y[0], y[1]] for x, y in
  3. product(df['col1'], zip(df['col2'], df['col3']))],
  4. columns=df.columns)

输出:

  1. col1 col2 col3
  2. 0 1 4 7
  3. 1 1 5 8
  4. 2 1 6 9
  5. 3 2 4 7
  6. 4 2 5 8
  7. 5 2 6 9
  8. 6 3 4 7
  9. 7 3 5 8
  10. 8 3 6 9
英文:

Another possible solution, which is based on itertools.product:

  1. from itertools import product
  2. pd.DataFrame([[x, y[0], y[1]] for x, y in
  3. product(df['col1'], zip(df['col2'], df['col3']))],
  4. columns=df.columns)

Output:

  1. col1 col2 col3
  2. 0 1 4 7
  3. 1 1 5 8
  4. 2 1 6 9
  5. 3 2 4 7
  6. 4 2 5 8
  7. 5 2 6 9
  8. 6 3 4 7
  9. 7 3 5 8
  10. 8 3 6 9

答案6

得分: 6

一种选择是使用pyjanitor中的complete函数:

  1. # pip install pyjanitor
  2. import janitor
  3. import pandas as pd
  4. df.complete('col1', ('col2', 'col3'))
  5. col1 col2 col3
  6. 0 1 4 7
  7. 1 1 5 8
  8. 2 1 6 9
  9. 3 2 4 7
  10. 4 2 5 8
  11. 5 2 6 9
  12. 6 3 4 7
  13. 7 3 5 8
  14. 8 3 6 9

complete主要用于暴露缺失的行 - 上面的输出只是一个不错的附带效果。更合适但相对冗长的选项是使用expand_grid

  1. # pip install pyjanitor
  2. import janitor as jn
  3. import pandas as pd
  4. others = {'df1': df.col1, 'df2': df[['col2', 'col3']]}
  5. jn.expand_grid(others=others).droplevel(axis=1, level=0)
  6. col1 col2 col3
  7. 0 1 4 7
  8. 1 1 5 8
  9. 2 1 6 8
  10. 3 2 4 7
  11. 4 2 5 8
  12. 5 2 6 8
  13. 6 2 4 7
  14. 7 2 5 8
  15. 8 2 6 8
英文:

One option is with complete from pyjanitor:

  1. # pip install pyjanitor
  2. import janitor
  3. import pandas as pd
  4. df.complete('col1', ('col2','col3'))
  5. col1 col2 col3
  6. 0 1 4 7
  7. 1 1 5 8
  8. 2 1 6 9
  9. 3 2 4 7
  10. 4 2 5 8
  11. 5 2 6 9
  12. 6 3 4 7
  13. 7 3 5 8
  14. 8 3 6 9

complete primarily is for exposing missing rows - the output above just happens to be a nice side effect. A more appropriate, albeit quite verbose option is expand_grid:

  1. # pip install pyjanitor
  2. import janitor as jn
  3. import pandas as pd
  4. others = {'df1':df.col1, 'df2':df[['col2','col3']]}
  5. jn.expand_grid(others=others).droplevel(axis=1,level=0)
  6. col1 col2 col3
  7. 0 1 4 7
  8. 1 1 5 8
  9. 2 1 6 8
  10. 3 2 4 7
  11. 4 2 5 8
  12. 5 2 6 8
  13. 6 2 4 7
  14. 7 2 5 8
  15. 8 2 6 8

答案7

得分: 0

这是使用带有键的concat方法的一种方式:

  1. pd.concat([df]*len(df), keys=df.pop('col1')).reset_index(level=0)

输出:

  1. col1 col2 col3
  2. 0 1 4 7
  3. 1 1 5 8
  4. 2 1 6 9
  5. 0 2 4 7
  6. 1 2 5 8
  7. 2 2 6 9
  8. 0 3 4 7
  9. 1 3 5 8
  10. 2 3 6 9
英文:

Here is a way using the concat with keys:

  1. pd.concat([df]*len(df),keys = df.pop('col1')).reset_index(level=0)

Output:

  1. col1 col2 col3
  2. 0 1 4 7
  3. 1 1 5 8
  4. 2 1 6 9
  5. 0 2 4 7
  6. 1 2 5 8
  7. 2 2 6 9
  8. 0 3 4 7
  9. 1 3 5 8
  10. 2 3 6 9

huangapple
  • 本文由 发表于 2023年5月26日 12:01:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76337589.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定