长度不匹配的值与分组分类列的索引长度不匹配

huangapple go评论59阅读模式
英文:

Length of values does not match length of index for groupby on categorical columns

问题

以下是翻译的部分:

假设我们有一个包含4列的DataFrame

import pandas as pd

df_test = pd.DataFrame({
    'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
    'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
    'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
    'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
})
df_test

这个示例DataFrame如下所示:

0 2022-12-01 id1 ip1 rid1
1 2022-12-01 id1 ip2 rid2
2 2022-12-01 id2 ip3 rid3
3 2022-12-02 id3 ip4 rid4
4 2022-12-02 id4 ip5 rid5
5 2022-12-02 id4 ip6 rid6

我需要:

  • 按照前两列:dateid 进行分组
  • 聚合其余列如下:
    • element 列为元素列表
    • related_id 列的不同值计数
(df_test
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),
     distinct_related_ids=('related_id', 'nunique')))

到目前为止,一切正常,结果正是我所期望的:

0 2022-12-01 id1 ['ip1', 'ip2'] 2
1 2022-12-01 id2 ['ip3'] 1
2 2022-12-02 id3 ['ip4'] 1
3 2022-12-02 id4 ['ip5', 'ip6'] 2

不幸的是,当我在一个更大的DataFrame上运行完全相同的代码时,我遇到了错误:

ValueError: Length of values (263128) does not match length of index (8156968)

顺便提一下,这是StackTrace的详细信息:

  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
    self._insert_inaxis_grouper_inplace(result)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
    result.insert(0, name, lev)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
    value = self._sanitize_column(value)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
    com.require_length_match(value, self.index)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)

这里发生了什么,我该如何调试或重写这个实现以使其在更大的DataFrame上工作?


英文:

Assuming we have a DataFrame with 4 columns:

import pandas as pd

df_test = pd.DataFrame({
    'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
    'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
    'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
    'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
})
df_test

This sample DataFrame looks like:

0	2022-12-01	id1	e1	rid1
1	2022-12-01	id1	e2	rid2
2	2022-12-01	id2	e3	rid3
3	2022-12-02	id3	e4	rid4
4	2022-12-02	id4	e5	rid5
5	2022-12-02	id4	e6	rid6

I need to:

  • Group by the first 2 columns: date and id
  • Aggregating the remaining columns as:
    • list of elements
    • distinct count of related_id
(df_test
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),
     distinct_related_ids=('related_id', 'nunique')))

So far so good, the result is exactly what I was looking for:

0	2022-12-01	id1	[e1, e2]	2
1	2022-12-01	id2	[e3]		1
2	2022-12-02	id3	[e4]		1
3	2022-12-02	id4	[e5, e6]	2

Unfortunately when I run the exact same code on a much bigger DataFrame, I get the Error:

ValueError: Length of values (263128) does not match length of index (8156968)

BTW the details for the StackTrace:

  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
    self._insert_inaxis_grouper_inplace(result)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
    result.insert(0, name, lev)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
    value = self._sanitize_column(value)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
    com.require_length_match(value, self.index)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)

What is going on here and how can I debug or rewrite this implementation to work with a bigger DataFrame?


EDIT

If I split that big DataFrame in single days, that groupby and agg combination works as expected. That error manifests itself only when multiple days are involved.


EDIT 2

While bisecting the "big DataFrame" looking for a smaller example as suggested by @mozway in the comments, I noticed that the problem seems related with groupby as changing the agg for the nunique gives the same error, with the same details:

(df_big
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (263128) does not match length of index (8156968)

Where 263128 is the number of unique id across the whole multi-day df_big.

Same error if using query on a single day:

import datetime

(df_big
.query('date == @datetime.date(2022, 12, 1)')
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (8682) does not match length of index (263128)

Where 8682 are the unique ids for that day:

df_big.query('date == @datetime.date(2022, 12, 1)')['id'].nunique()

EDIT 3

In the attempt to rule out the assumption this problem is related to the size of the DataFrame, I followed the suggestion from the first link mentioned in the comments by @wjandrea and generated many synthetic DataFrames:

import numpy as np

rows = 10_000_000
np.random.seed()

df_test = pd.DataFrame({
    'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
    'id': np.random.choice(range(100), rows),
    'element': np.random.randn(rows),
    'related_id': np.random.choice(range(10), rows),
})
df_test_compressed = df_test_3.groupby(['date', 'id'], as_index=False).agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique'))
df_test_compressed['elements'].map(len).value_counts()

Unfortunately not a single instance tested threw that ValueError.


EDIT 4

"minimal reproducible example" with only 2 rows:

import pandas as pd
import numpy as np


rows = 2
np.random.seed(42)

df = pd.DataFrame({
    'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
    'id': np.random.choice(range(100), rows),
    'element': np.random.randn(rows),
    'related_id': np.random.choice(range(10), rows),
}).astype({
    'id': 'category',
    'element': 'category',
    'related_id': 'category',
})

Group by and aggregating:

(df
 .groupby(['date', 'id'], as_index=False)
 .agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))

Triggers the Error:

ValueError: Length of values (2) does not match length of index (4)

EDIT 5

As of March 2023 it is still an unresolved bug.

答案1

得分: 1

问题在于索引中的一个列的“category”类型,这种情况下是“id”。

避免这个错误的一种方法是为“id”列使用字面类型,比如“string”:

(df.astype({'id': 'string'})
 .groupby(['date', 'id'], as_index=False)
 .agg(elements=('element', pd.Series.to_list), distinct_related_ids=('related_id', 'nunique')))
英文:

The problem here is the category type of one of the columns in the index, id in this case.

One way to avoid that error is to use a literal type for the id column, for instance string:

(df.astype({'id': 'string'})
 .groupby(['date', 'id'], as_index=False)
 .agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))

huangapple
  • 本文由 发表于 2023年3月9日 23:12:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686525.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定