从分组的数据框创建堆叠的NumPy数组。

huangapple go评论121阅读模式
英文:

Create stacked numpy array from grouped dataframe

问题

I need a fast function to create a single stacked numpy array from a Pandas dataframe after grouping the dataframe, with missing row values added. The output array should have a shape (n_unique_values_1, n_unique_values_2, ..., n_ungrouped_columns) for the group by columns 1, 2, .... Missing values should be filled nan, you may assume that all values can be safely handed as numeric.

Example:

  1. import numpy as np
  2. import pandas as pd
  3. df = df = pd.DataFrame({
  4. 'a': [1, 1, 2],
  5. 'b': [0, 1, 0],
  6. 'c': [1, 1, 1],
  7. 'd': [0, 0, 0]
  8. })
  9. grouped = df.groupby(['a', 'b']).agg(sum)

I need a function on grouped which returns a numpy array of shape (df.a.nunique(), df.b.nunique(), n_ungrouped_cols) (in this case, (2, 2, 2)). The function should work with arbitrary lengths of group, the returned array should have the axis in the same order as the groupby, and I need to run this on lots of millions of values in a pipeline that has a lot to do. Fast as hell would be very much appreciated. Oh, Pandas groupby preserves the ascending order of the unique values and that should not be lost. But if you can write this without using a grouped dataframe, go for it. Any imports (numba, etc.) that can make this quick are acceptable as long as they are from well maintained code bases.

The following can be used to create tests cases, and test_case in this example has missing rows:

  1. def create_synthetic_df(len_df, n_uniques: list[int]):
  2. rng = np.random.default_rng(seed=2)
  3. if len(n_uniques) > 10:
  4. n_uniques = n_uniques[:10]
  5. dct = {}
  6. for col, n_unique in zip('abcdefghij', n_uniques):
  7. dct[col] = rng.integers(0, n_unique, size=len_df)
  8. return pd.DataFrame(dct)
  9. n_uniques = (50, 3, 10, 10, 3)
  10. test_case = create_synthetic_df(1000, n_uniques).groupby(['a', 'b', 'c']).agg(sum)
  11. def my_func(grouped_df) -> np.ndarray:
  12. """Call the solution 'my_func'."""
  13. ...
  14. # additional test cases, maybe not exhaustive
  15. simple_case = my_func(grouped)
  16. expected = np.array([
  17. [[1, 0],
  18. [1, 0]],
  19. [[1, 0],
  20. [np.nan, np.nan]]
  21. ])
  22. assert simple_case.shape == (2, 2, 2)
  23. assert np.allclose(simple_case, expected, equal_nan=True)
  24. assert my_func(test_case).shape == (50, 3, 10, 2)
英文:

I need a fast function to create a single stacked numpy array from a Pandas dataframe after grouping the dataframe, with missing row values added. The output array should have a shape (n_unique_values_1, n_unique_values_2, ..., n_ungrouped_columns) for the group by columns 1, 2, .... Missing values should be filled nan, you may assume that all values can be safely handed as numeric.

Example:

  1. import numpy as np
  2. import pandas as pd
  3. df = df = pd.DataFrame({
  4. 'a': [1, 1, 2],
  5. 'b': [0, 1, 0],
  6. 'c': [1, 1, 1],
  7. 'd': [0, 0, 0]
  8. })
  9. grouped = df.groupby(['a', 'b']).agg(sum)

I need a function on grouped which returns a numpy array of shape (df.a.nunique(), df.b.nunique(), n_ungrouped_cols) (in this case, (2, 2, 2)). The function should work with arbitrary lengths of group, the returned array should have the axis in the same order as the groupby, and I need to run this on lots of millions of values in a pipeline that has a lot to do. Fast as hell would be very much appreciated. Oh, Pandas groupby preserves the ascending order of the unique values and that should not be lost. But if you can write this without using a grouped dataframe, go for it. Any imports (numba, etc.) that can make this quick are acceptable as long as they are from well maintained code bases.

The following can be used to create tests cases, and test_case in this example has missing rows:

  1. def create_synthetic_df(len_df, n_uniques: list[int]):
  2. rng = np.random.default_rng(seed=2)
  3. if len(n_uniques) > 10:
  4. n_uniques = n_uniques[:10]
  5. dct = {}
  6. for col, n_unique in zip('abcdefghij', n_uniques):
  7. dct[col] = rng.integers(0, n_unique, size=len_df)
  8. return pd.DataFrame(dct)
  9. n_uniques = (50, 3, 10, 10, 3)
  10. test_case = create_synthetic_df(1000, n_uniques).groupby(['a', 'b', 'c']).agg(sum)
  11. def my_func(grouped_df) -> np.ndarray:
  12. """Call the solution 'my_func'."""
  13. ...
  14. # additional test cases, maybe not exhaustive
  15. simple_case = my_func(grouped)
  16. expected = np.array([
  17. [[1, 0],
  18. [1, 0]],
  19. [[1, 0],
  20. [np.nan, np.nan]]
  21. ])
  22. assert simple_case.shape == (2, 2, 2)
  23. assert np.allclose(simple_case, expected, equal_nan=True)
  24. assert my_func(test_case).shape == (50, 3, 10, 2)

答案1

得分: 1

以下是您要的代码部分的翻译:

  1. Fundamentally this is a reindexing operation. There are trickier ways to do this, and I have not profiled this code; this is the "unsurprising" approach.
  2. The second approach uses lower-level Numpy but I don't know which one will be faster. They are tested to be equivalent.
  3. import string
  4. from typing import Sequence
  5. import numpy as np
  6. import pandas as pd
  7. def create_synthetic_df(len_df: int, n_uniques: Sequence[int], seed: int = 2) -> pd.DataFrame:
  8. rng = np.random.default_rng(seed=seed)
  9. df = pd.DataFrame(
  10. data=1 + rng.integers(low=0, high=n_uniques, size=(len_df, len(n_uniques))),
  11. columns=tuple(string.ascii_lowercase[:len(n_uniques)]),
  12. )
  13. return df
  14. def mi_reindex(df: pd.DataFrame, group_cols: list[str]) -> np.ndarray:
  15. totals: pd.DataFrame = df.groupby(group_cols).sum()
  16. uindex = [
  17. totals.index.unique(level=level).sort_values()
  18. for level in group_cols
  19. ]
  20. full_index = pd.MultiIndex.from_product(iterables=uindex)
  21. aligned = totals.reindex(full_index)
  22. reshaped = aligned.values.reshape((
  23. *(
  24. u.size for u in uindex
  25. ),
  26. totals.columns.size,
  27. ))
  28. return reshaped
  29. def np_unique(df: pd.DataFrame, group_cols: list[str]) -> np.ndarray:
  30. totals = df.groupby(group_cols).sum()
  31. uniques = [
  32. np.unique(
  33. ar=totals.index.get_level_values(col),
  34. return_inverse=True,
  35. )
  36. for col in group_cols
  37. ]
  38. dest = np.full(
  39. shape=(
  40. *(u.size for u in uniques),
  41. totals.columns.size,
  42. ),
  43. fill_value=np.nan,
  44. )
  45. idx = tuple(i for u, i in uniques) + (slice(None),)
  46. dest[idx] = totals
  47. return dest
  48. def test() -> None:
  49. simple_outputs = []
  50. big_outputs = []
  51. big_uniques = (50, 3, 10, 10, 3)
  52. big_input = create_synthetic_df(1000, big_uniques)
  53. simple_input = pd.DataFrame({
  54. 'a': [1, 1, 2],
  55. 'b': [0, 1, 0],
  56. 'c': [1, 1, 1],
  57. 'd': [0, 0, 0]
  58. })
  59. simple_output = np.array([
  60. [[1, 0],
  61. [1, 0]], # this is not 2
  62. [[1, 0],
  63. [np.nan, np.nan]]
  64. ])
  65. for my_func in (mi_reindex, np_unique):
  66. actual = my_func(simple_input, ['a', 'b'])
  67. assert actual.shape == (2, 2, 2)
  68. assert np.allclose(actual, simple_output, equal_nan=True)
  69. simple_outputs.append(actual)
  70. actual = my_func(big_input, ['a', 'b', 'c'])
  71. assert actual.shape == (50, 3, 10, 2)
  72. big_outputs.append(actual)
  73. assert np.allclose(*simple_outputs, equal_nan=True)
  74. assert np.allclose(*big_outputs, equal_nan=True)
  75. if __name__ == '__main__':
  76. test()

希望这对您有所帮助。如果您需要更多帮助,请随时告诉我。

英文:

Fundamentally this is a reindexing operation. There are trickier ways to do this, and I have not profiled this code; this is the "unsurprising" approach.

The second approach uses lower-level Numpy but I don't know which one will be faster. They are tested to be equivalent.

  1. import string
  2. from typing import Sequence
  3. import numpy as np
  4. import pandas as pd
  5. def create_synthetic_df(len_df: int, n_uniques: Sequence[int], seed: int = 2) -> pd.DataFrame:
  6. rng = np.random.default_rng(seed=seed)
  7. df = pd.DataFrame(
  8. data=1 + rng.integers(low=0, high=n_uniques, size=(len_df, len(n_uniques))),
  9. columns=tuple(string.ascii_lowercase[:len(n_uniques)]),
  10. )
  11. return df
  12. def mi_reindex(df: pd.DataFrame, group_cols: list[str]) -> np.ndarray:
  13. totals: pd.DataFrame = df.groupby(group_cols).sum()
  14. uindex = [
  15. totals.index.unique(level=level).sort_values()
  16. for level in group_cols
  17. ]
  18. full_index = pd.MultiIndex.from_product(iterables=uindex)
  19. aligned = totals.reindex(full_index)
  20. reshaped = aligned.values.reshape((
  21. *(
  22. u.size for u in uindex
  23. ),
  24. totals.columns.size,
  25. ))
  26. return reshaped
  27. def np_unique(df: pd.DataFrame, group_cols: list[str]) -> np.ndarray:
  28. totals = df.groupby(group_cols).sum()
  29. uniques = [
  30. np.unique(
  31. ar=totals.index.get_level_values(col),
  32. return_inverse=True,
  33. )
  34. for col in group_cols
  35. ]
  36. dest = np.full(
  37. shape=(
  38. *(u.size for u, idx in uniques),
  39. totals.columns.size,
  40. ),
  41. fill_value=np.nan,
  42. )
  43. idx = tuple(i for u, i in uniques) + (slice(None),)
  44. dest[idx] = totals
  45. return dest
  46. def test() -> None:
  47. simple_outputs = []
  48. big_outputs = []
  49. big_uniques = (50, 3, 10, 10, 3)
  50. big_input = create_synthetic_df(1000, big_uniques)
  51. simple_input = pd.DataFrame({
  52. 'a': [1, 1, 2],
  53. 'b': [0, 1, 0],
  54. 'c': [1, 1, 1],
  55. 'd': [0, 0, 0]
  56. })
  57. simple_output = np.array([
  58. [[1, 0],
  59. [1, 0]], # this is not 2
  60. [[1, 0],
  61. [np.nan, np.nan]]
  62. ])
  63. for my_func in (mi_reindex, np_unique):
  64. actual = my_func(simple_input, ['a', 'b'])
  65. assert actual.shape == (2, 2, 2)
  66. assert np.allclose(actual, simple_output, equal_nan=True)
  67. simple_outputs.append(actual)
  68. actual = my_func(big_input, ['a', 'b', 'c'])
  69. assert actual.shape == (50, 3, 10, 2)
  70. big_outputs.append(actual)
  71. assert np.allclose(*simple_outputs, equal_nan=True)
  72. assert np.allclose(*big_outputs, equal_nan=True)
  73. if __name__ == '__main__':
  74. test()

huangapple
  • 本文由 发表于 2023年8月5日 09:33:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76839804.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定