长度不匹配的值与分组分类列的索引长度不匹配

huangapple go评论86阅读模式
英文:

Length of values does not match length of index for groupby on categorical columns

问题

以下是翻译的部分:

  1. 假设我们有一个包含4列的DataFrame
  2. import pandas as pd
  3. df_test = pd.DataFrame({
  4. 'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
  5. 'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
  6. 'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
  7. 'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
  8. })
  9. df_test

这个示例DataFrame如下所示:

0 2022-12-01 id1 ip1 rid1
1 2022-12-01 id1 ip2 rid2
2 2022-12-01 id2 ip3 rid3
3 2022-12-02 id3 ip4 rid4
4 2022-12-02 id4 ip5 rid5
5 2022-12-02 id4 ip6 rid6

我需要:

  • 按照前两列:dateid 进行分组
  • 聚合其余列如下:
    • element 列为元素列表
    • related_id 列的不同值计数
  1. (df_test
  2. .groupby(['date', 'id'], as_index=False)
  3. .agg(elements=('element', pd.Series.to_list),
  4. distinct_related_ids=('related_id', 'nunique')))

到目前为止,一切正常,结果正是我所期望的:

0 2022-12-01 id1 ['ip1', 'ip2'] 2
1 2022-12-01 id2 ['ip3'] 1
2 2022-12-02 id3 ['ip4'] 1
3 2022-12-02 id4 ['ip5', 'ip6'] 2

不幸的是,当我在一个更大的DataFrame上运行完全相同的代码时,我遇到了错误:

  1. ValueError: Length of values (263128) does not match length of index (8156968)

顺便提一下,这是StackTrace的详细信息:

  1. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
  2. self._insert_inaxis_grouper_inplace(result)
  3. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
  4. result.insert(0, name, lev)
  5. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
  6. value = self._sanitize_column(value)
  7. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
  8. com.require_length_match(value, self.index)
  9. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
  10. raise ValueError(
  11. ValueError: Length of values (263128) does not match length of index (8156968)

这里发生了什么,我该如何调试或重写这个实现以使其在更大的DataFrame上工作?


英文:

Assuming we have a DataFrame with 4 columns:

  1. import pandas as pd
  2. df_test = pd.DataFrame({
  3. 'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
  4. 'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
  5. 'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
  6. 'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
  7. })
  8. df_test

This sample DataFrame looks like:

  1. 0 2022-12-01 id1 e1 rid1
  2. 1 2022-12-01 id1 e2 rid2
  3. 2 2022-12-01 id2 e3 rid3
  4. 3 2022-12-02 id3 e4 rid4
  5. 4 2022-12-02 id4 e5 rid5
  6. 5 2022-12-02 id4 e6 rid6

I need to:

  • Group by the first 2 columns: date and id
  • Aggregating the remaining columns as:
    • list of elements
    • distinct count of related_id
  1. (df_test
  2. .groupby(['date', 'id'], as_index=False)
  3. .agg(elements=('element', pd.Series.to_list),
  4. distinct_related_ids=('related_id', 'nunique')))

So far so good, the result is exactly what I was looking for:

  1. 0 2022-12-01 id1 [e1, e2] 2
  2. 1 2022-12-01 id2 [e3] 1
  3. 2 2022-12-02 id3 [e4] 1
  4. 3 2022-12-02 id4 [e5, e6] 2

Unfortunately when I run the exact same code on a much bigger DataFrame, I get the Error:

  1. ValueError: Length of values (263128) does not match length of index (8156968)

BTW the details for the StackTrace:

  1. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
  2. self._insert_inaxis_grouper_inplace(result)
  3. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
  4. result.insert(0, name, lev)
  5. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
  6. value = self._sanitize_column(value)
  7. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
  8. com.require_length_match(value, self.index)
  9. File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
  10. raise ValueError(
  11. ValueError: Length of values (263128) does not match length of index (8156968)

What is going on here and how can I debug or rewrite this implementation to work with a bigger DataFrame?


EDIT

If I split that big DataFrame in single days, that groupby and agg combination works as expected. That error manifests itself only when multiple days are involved.


EDIT 2

While bisecting the "big DataFrame" looking for a smaller example as suggested by @mozway in the comments, I noticed that the problem seems related with groupby as changing the agg for the nunique gives the same error, with the same details:

  1. (df_big
  2. .groupby(['date', 'id'], as_index=False)
  3. .nunique())
  1. ValueError: Length of values (263128) does not match length of index (8156968)

Where 263128 is the number of unique id across the whole multi-day df_big.

Same error if using query on a single day:

  1. import datetime
  2. (df_big
  3. .query('date == @datetime.date(2022, 12, 1)')
  4. .groupby(['date', 'id'], as_index=False)
  5. .nunique())
  1. ValueError: Length of values (8682) does not match length of index (263128)

Where 8682 are the unique ids for that day:

  1. df_big.query('date == @datetime.date(2022, 12, 1)')['id'].nunique()

EDIT 3

In the attempt to rule out the assumption this problem is related to the size of the DataFrame, I followed the suggestion from the first link mentioned in the comments by @wjandrea and generated many synthetic DataFrames:

  1. import numpy as np
  2. rows = 10_000_000
  3. np.random.seed()
  4. df_test = pd.DataFrame({
  5. 'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
  6. 'id': np.random.choice(range(100), rows),
  7. 'element': np.random.randn(rows),
  8. 'related_id': np.random.choice(range(10), rows),
  9. })
  10. df_test_compressed = df_test_3.groupby(['date', 'id'], as_index=False).agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique'))
  11. df_test_compressed['elements'].map(len).value_counts()

Unfortunately not a single instance tested threw that ValueError.


EDIT 4

"minimal reproducible example" with only 2 rows:

  1. import pandas as pd
  2. import numpy as np
  3. rows = 2
  4. np.random.seed(42)
  5. df = pd.DataFrame({
  6. 'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
  7. 'id': np.random.choice(range(100), rows),
  8. 'element': np.random.randn(rows),
  9. 'related_id': np.random.choice(range(10), rows),
  10. }).astype({
  11. 'id': 'category',
  12. 'element': 'category',
  13. 'related_id': 'category',
  14. })

Group by and aggregating:

  1. (df
  2. .groupby(['date', 'id'], as_index=False)
  3. .agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))

Triggers the Error:

  1. ValueError: Length of values (2) does not match length of index (4)

EDIT 5

As of March 2023 it is still an unresolved bug.

答案1

得分: 1

问题在于索引中的一个列的“category”类型,这种情况下是“id”。

避免这个错误的一种方法是为“id”列使用字面类型,比如“string”:

  1. (df.astype({'id': 'string'})
  2. .groupby(['date', 'id'], as_index=False)
  3. .agg(elements=('element', pd.Series.to_list), distinct_related_ids=('related_id', 'nunique')))
英文:

The problem here is the category type of one of the columns in the index, id in this case.

One way to avoid that error is to use a literal type for the id column, for instance string:

  1. (df.astype({'id': 'string'})
  2. .groupby(['date', 'id'], as_index=False)
  3. .agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))

huangapple
  • 本文由 发表于 2023年3月9日 23:12:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686525.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定