英文:
Length of values does not match length of index for groupby on categorical columns
问题
以下是翻译的部分:
假设我们有一个包含4列的DataFrame:
import pandas as pd
df_test = pd.DataFrame({
'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
})
df_test
这个示例DataFrame如下所示:
0 2022-12-01 id1 ip1 rid1
1 2022-12-01 id1 ip2 rid2
2 2022-12-01 id2 ip3 rid3
3 2022-12-02 id3 ip4 rid4
4 2022-12-02 id4 ip5 rid5
5 2022-12-02 id4 ip6 rid6
我需要:
- 按照前两列:
date
和id
进行分组 - 聚合其余列如下:
element
列为元素列表related_id
列的不同值计数
(df_test
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),
distinct_related_ids=('related_id', 'nunique')))
到目前为止,一切正常,结果正是我所期望的:
0 2022-12-01 id1 ['ip1', 'ip2'] 2
1 2022-12-01 id2 ['ip3'] 1
2 2022-12-02 id3 ['ip4'] 1
3 2022-12-02 id4 ['ip5', 'ip6'] 2
不幸的是,当我在一个更大的DataFrame上运行完全相同的代码时,我遇到了错误:
ValueError: Length of values (263128) does not match length of index (8156968)
顺便提一下,这是StackTrace的详细信息:
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
self._insert_inaxis_grouper_inplace(result)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
result.insert(0, name, lev)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
value = self._sanitize_column(value)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
com.require_length_match(value, self.index)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)
这里发生了什么,我该如何调试或重写这个实现以使其在更大的DataFrame上工作?
英文:
Assuming we have a DataFrame with 4 columns:
import pandas as pd
df_test = pd.DataFrame({
'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
})
df_test
This sample DataFrame looks like:
0 2022-12-01 id1 e1 rid1
1 2022-12-01 id1 e2 rid2
2 2022-12-01 id2 e3 rid3
3 2022-12-02 id3 e4 rid4
4 2022-12-02 id4 e5 rid5
5 2022-12-02 id4 e6 rid6
I need to:
- Group by the first 2 columns:
date
andid
- Aggregating the remaining columns as:
- list of
element
s - distinct count of
related_id
- list of
(df_test
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),
distinct_related_ids=('related_id', 'nunique')))
So far so good, the result is exactly what I was looking for:
0 2022-12-01 id1 [e1, e2] 2
1 2022-12-01 id2 [e3] 1
2 2022-12-02 id3 [e4] 1
3 2022-12-02 id4 [e5, e6] 2
Unfortunately when I run the exact same code on a much bigger DataFrame, I get the Error:
ValueError: Length of values (263128) does not match length of index (8156968)
BTW the details for the StackTrace:
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
self._insert_inaxis_grouper_inplace(result)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
result.insert(0, name, lev)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
value = self._sanitize_column(value)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
com.require_length_match(value, self.index)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)
What is going on here and how can I debug or rewrite this implementation to work with a bigger DataFrame?
EDIT
If I split that big DataFrame in single days, that groupby
and agg
combination works as expected. That error manifests itself only when multiple days are involved.
EDIT 2
While bisecting the "big DataFrame" looking for a smaller example as suggested by @mozway in the comments, I noticed that the problem seems related with groupby
as changing the agg
for the nunique
gives the same error, with the same details:
(df_big
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (263128) does not match length of index (8156968)
Where 263128
is the number of unique id
across the whole multi-day df_big
.
Same error if using query
on a single day:
import datetime
(df_big
.query('date == @datetime.date(2022, 12, 1)')
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (8682) does not match length of index (263128)
Where 8682
are the unique id
s for that day:
df_big.query('date == @datetime.date(2022, 12, 1)')['id'].nunique()
EDIT 3
In the attempt to rule out the assumption this problem is related to the size of the DataFrame, I followed the suggestion from the first link mentioned in the comments by @wjandrea and generated many synthetic DataFrames:
import numpy as np
rows = 10_000_000
np.random.seed()
df_test = pd.DataFrame({
'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
'id': np.random.choice(range(100), rows),
'element': np.random.randn(rows),
'related_id': np.random.choice(range(10), rows),
})
df_test_compressed = df_test_3.groupby(['date', 'id'], as_index=False).agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique'))
df_test_compressed['elements'].map(len).value_counts()
Unfortunately not a single instance tested threw that ValueError
.
EDIT 4
"minimal reproducible example" with only 2 rows:
import pandas as pd
import numpy as np
rows = 2
np.random.seed(42)
df = pd.DataFrame({
'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
'id': np.random.choice(range(100), rows),
'element': np.random.randn(rows),
'related_id': np.random.choice(range(10), rows),
}).astype({
'id': 'category',
'element': 'category',
'related_id': 'category',
})
Group by and aggregating:
(df
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))
Triggers the Error:
ValueError: Length of values (2) does not match length of index (4)
EDIT 5
As of March 2023 it is still an unresolved bug.
答案1
得分: 1
问题在于索引中的一个列的“category”类型,这种情况下是“id”。
避免这个错误的一种方法是为“id”列使用字面类型,比如“string”:
(df.astype({'id': 'string'})
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list), distinct_related_ids=('related_id', 'nunique')))
英文:
The problem here is the category
type of one of the columns in the index, id
in this case.
One way to avoid that error is to use a literal type for the id
column, for instance string
:
(df.astype({'id': 'string'})
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论