2023年3月9日 23:12:45go评论86阅读模式

英文:

Length of values does not match length of index for groupby on categorical columns

问题

以下是翻译的部分：

假设我们有一个包含4列的DataFrame：
import pandas as pd
df_test = pd.DataFrame({
    'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
    'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
    'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
    'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
})
df_test

这个示例DataFrame如下所示：

0 2022-12-01 id1 ip1 rid1
1 2022-12-01 id1 ip2 rid2
2 2022-12-01 id2 ip3 rid3
3 2022-12-02 id3 ip4 rid4
4 2022-12-02 id4 ip5 rid5
5 2022-12-02 id4 ip6 rid6

我需要：

按照前两列：date 和 id 进行分组
聚合其余列如下：
- element 列为元素列表
- related_id 列的不同值计数

(df_test
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),
     distinct_related_ids=('related_id', 'nunique')))

到目前为止，一切正常，结果正是我所期望的：

0 2022-12-01 id1 ['ip1', 'ip2'] 2
1 2022-12-01 id2 ['ip3'] 1
2 2022-12-02 id3 ['ip4'] 1
3 2022-12-02 id4 ['ip5', 'ip6'] 2

不幸的是，当我在一个更大的DataFrame上运行完全相同的代码时，我遇到了错误：

ValueError: Length of values (263128) does not match length of index (8156968)

顺便提一下，这是StackTrace的详细信息：

  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py&quot;, line 950, in aggregate
    self._insert_inaxis_grouper_inplace(result)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py&quot;, line 1485, in _insert_inaxis_grouper_inplace
    result.insert(0, name, lev)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py&quot;, line 4821, in insert
    value = self._sanitize_column(value)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py&quot;, line 4915, in _sanitize_column
    com.require_length_match(value, self.index)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py&quot;, line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)

这里发生了什么，我该如何调试或重写这个实现以使其在更大的DataFrame上工作？

英文:

Assuming we have a DataFrame with 4 columns:

import pandas as pd
df_test = pd.DataFrame({
    &#39;date&#39;: [&#39;2022-12-01&#39;, &#39;2022-12-01&#39;, &#39;2022-12-01&#39;, &#39;2022-12-02&#39;, &#39;2022-12-02&#39;, &#39;2022-12-02&#39;],
    &#39;id&#39;: [&#39;id1&#39;, &#39;id1&#39;, &#39;id2&#39;, &#39;id3&#39;, &#39;id4&#39;, &#39;id4&#39;],
    &#39;element&#39;: [&#39;ip1&#39;, &#39;ip2&#39;, &#39;ip3&#39;, &#39;ip4&#39;, &#39;ip5&#39;, &#39;ip6&#39;],
    &#39;related_id&#39;: [&#39;rid1&#39;, &#39;rid2&#39;, &#39;rid3&#39;, &#39;rid4&#39;, &#39;rid5&#39;, &#39;rid6&#39;],
})
df_test

This sample DataFrame looks like:

0	2022-12-01	id1	e1	rid1
1	2022-12-01	id1	e2	rid2
2	2022-12-01	id2	e3	rid3
3	2022-12-02	id3	e4	rid4
4	2022-12-02	id4	e5	rid5
5	2022-12-02	id4	e6	rid6

I need to:

Group by the first 2 columns: date and id
Aggregating the remaining columns as:
- list of elements
- distinct count of related_id

(df_test
.groupby([&#39;date&#39;, &#39;id&#39;], as_index=False)
.agg(elements=(&#39;element&#39;, pd.Series.to_list),
     distinct_related_ids=(&#39;related_id&#39;, &#39;nunique&#39;)))

So far so good, the result is exactly what I was looking for:

0	2022-12-01	id1	[e1, e2]	2
1	2022-12-01	id2	[e3]		1
2	2022-12-02	id3	[e4]		1
3	2022-12-02	id4	[e5, e6]	2

Unfortunately when I run the exact same code on a much bigger DataFrame, I get the Error:

ValueError: Length of values (263128) does not match length of index (8156968)

BTW the details for the StackTrace:

  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py&quot;, line 950, in aggregate
    self._insert_inaxis_grouper_inplace(result)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py&quot;, line 1485, in _insert_inaxis_grouper_inplace
    result.insert(0, name, lev)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py&quot;, line 4821, in insert
    value = self._sanitize_column(value)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py&quot;, line 4915, in _sanitize_column
    com.require_length_match(value, self.index)
  File &quot;/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py&quot;, line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)

What is going on here and how can I debug or rewrite this implementation to work with a bigger DataFrame?

EDIT

If I split that big DataFrame in single days, that groupby and agg combination works as expected. That error manifests itself only when multiple days are involved.

EDIT 2

While bisecting the "big DataFrame" looking for a smaller example as suggested by @mozway in the comments, I noticed that the problem seems related with groupby as changing the agg for the nunique gives the same error, with the same details:

(df_big
.groupby([&#39;date&#39;, &#39;id&#39;], as_index=False)
.nunique())

ValueError: Length of values (263128) does not match length of index (8156968)

Where 263128 is the number of unique id across the whole multi-day df_big.

Same error if using query on a single day:

import datetime
(df_big
.query(&#39;date == @datetime.date(2022, 12, 1)&#39;)
.groupby([&#39;date&#39;, &#39;id&#39;], as_index=False)
.nunique())

ValueError: Length of values (8682) does not match length of index (263128)

Where 8682 are the unique ids for that day:

df_big.query(&#39;date == @datetime.date(2022, 12, 1)&#39;)[&#39;id&#39;].nunique()

EDIT 3

In the attempt to rule out the assumption this problem is related to the size of the DataFrame, I followed the suggestion from the first link mentioned in the comments by @wjandrea and generated many synthetic DataFrames:

import numpy as np
rows = 10_000_000
np.random.seed()
df_test = pd.DataFrame({
    &#39;date&#39;: np.random.choice(pd.date_range(&#39;2022-12-01&#39;, periods=31, freq=&#39;D&#39;), rows),
    &#39;id&#39;: np.random.choice(range(100), rows),
    &#39;element&#39;: np.random.randn(rows),
    &#39;related_id&#39;: np.random.choice(range(10), rows),
})
df_test_compressed = df_test_3.groupby([&#39;date&#39;, &#39;id&#39;], as_index=False).agg(elements=(&#39;element&#39;, pd.Series.to_list),distinct_related_ids=(&#39;related_id&#39;, &#39;nunique&#39;))
df_test_compressed[&#39;elements&#39;].map(len).value_counts()

Unfortunately not a single instance tested threw that ValueError.

EDIT 4

"minimal reproducible example" with only 2 rows:

import pandas as pd
import numpy as np
rows = 2
np.random.seed(42)
df = pd.DataFrame({
    &#39;date&#39;: np.random.choice(pd.date_range(&#39;2022-12-01&#39;, periods=31, freq=&#39;D&#39;), rows),
    &#39;id&#39;: np.random.choice(range(100), rows),
    &#39;element&#39;: np.random.randn(rows),
    &#39;related_id&#39;: np.random.choice(range(10), rows),
}).astype({
    &#39;id&#39;: &#39;category&#39;,
    &#39;element&#39;: &#39;category&#39;,
    &#39;related_id&#39;: &#39;category&#39;,
})

Group by and aggregating:

(df
 .groupby([&#39;date&#39;, &#39;id&#39;], as_index=False)
 .agg(elements=(&#39;element&#39;, pd.Series.to_list),distinct_related_ids=(&#39;related_id&#39;, &#39;nunique&#39;)))

Triggers the Error:

ValueError: Length of values (2) does not match length of index (4)

EDIT 5

As of March 2023 it is still an unresolved bug.

答案1

得分: 1

问题在于索引中的一个列的“category”类型，这种情况下是“id”。

避免这个错误的一种方法是为“id”列使用字面类型，比如“string”：

(df.astype({'id': 'string'})
 .groupby(['date', 'id'], as_index=False)
 .agg(elements=('element', pd.Series.to_list), distinct_related_ids=('related_id', 'nunique')))

英文:

The problem here is the category type of one of the columns in the index, id in this case.

One way to avoid that error is to use a literal type for the id column, for instance string:

(df.astype({&#39;id&#39;: &#39;string&#39;})
 .groupby([&#39;date&#39;, &#39;id&#39;], as_index=False)
 .agg(elements=(&#39;element&#39;, pd.Series.to_list),distinct_related_ids=(&#39;related_id&#39;, &#39;nunique&#39;)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

长度不匹配的值与分组分类列的索引长度不匹配

问题

答案1

LINQ的Group by未返回预期结果。

如何对包含字符串和整数的列中的整数进行数据框排序？

如何清理带有不同格式名称的列（用逗号、点等分隔）？

获取每30分钟的行数值？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。