问题

如何处理category列？例如，如何筛选category列中包含-inf的值或上限介于10.0和30.0之间的值，或类似的操作？

英文:

Let's say I have this:

&gt;&gt;&gt; df = polars.DataFrame(dict(j=numpy.random.randint(10, 99, 20)))
&gt;&gt;&gt; df
shape: (20, 1)
┌─────┐
│ j   │
│ --- │
│ i64 │
╞═════╡
│ 47  │
│ 22  │
│ 82  │
│ 19  │
│ …   │
│ 28  │
│ 94  │
│ 21  │
│ 38  │
└─────┘
&gt;&gt;&gt; df.get_column(&#39;j&#39;).hist([10, 20, 30, 50])
shape: (5, 3)
┌─────────────┬──────────────┬─────────┐
│ break_point ┆ category     ┆ j_count │
│ ---         ┆ ---          ┆ ---     │
│ f64         ┆ cat          ┆ u32     │
╞═════════════╪══════════════╪═════════╡
│ 10.0        ┆ (-inf, 10.0] ┆ 0       │
│ 20.0        ┆ (10.0, 20.0] ┆ 4       │
│ 30.0        ┆ (20.0, 30.0] ┆ 5       │
│ 50.0        ┆ (30.0, 50.0] ┆ 3       │
│ inf         ┆ (50.0, inf]  ┆ 8       │
└─────────────┴──────────────┴─────────┘

How would I go with doing something with the category column? For example, how would I filter values where cateogry has -inf or where upper bound is between 10.0 and 30.0 or something along those lines?

答案1

得分: 1

也许有更好的方法，但您可以执行手动步骤来拆分结果并将其转换为浮点列：

hist = df.get_column('j').hist([10, 20, 30, 50])

hist.with_columns(
   pl.col('category')
     .cast(pl.Utf8)
     .str.strip('()')
     .str.split(', ')
     .cast(pl.List(pl.Float64))
     .list.to_struct(fields=['lower', 'upper'])
).unnest('category')

shape: (5, 4)
┌─────────────┬───────┬───────┬─────────┐
│ break_point │ lower │ upper │ j_count │
│ ---         │ ---   │ ---   │ ---     │
│ f64         │ f64   │ f64   │ u32     │
╞═════════════╪═══════╪═══════╪═════════╡
│ 10.0        │ -inf  │ 10.0  │ 0       │
│ 20.0        │ 10.0  │ 20.0  │ 1       │
│ 30.0        │ 20.0  │ 30.0  │ 1       │
│ 50.0        │ 30.0  │ 50.0  │ 6       │
│ inf         │ 50.0  │ inf   │ 12      │
└─────────────┴───────┴───────┴─────────┘

**更新：**也许您可以使用表达式来模拟计数。

您可以创建一个函数，但类似以下的方式也可以：

bins = [10, 20, 30, 50]

df.with_columns(hist = 
   pl.coalesce(
      pl.when(pl.col('j').is_between(lower, upper, closed='right'))
        .then(pl.struct(break_point=upper, lower=lower, upper=upper))
      for bins in [[float('-inf')] + bins + [float('inf')]]
      for idx  in range(len(bins) - 1)
      for lower, upper in [[bins[idx], bins[idx + 1]]]
   )
).groupby('hist').count().unnest('hist')

shape: (4, 4)
┌─────────────┬───────┬───────┬───────┐
│ break_point │ lower │ upper │ count │
│ ---         │ ---   │ ---   │ ---   │
│ f64         │ f64   │ f64   │ u32   │
╞═════════════╪═══════╪═══════╪═══════╡
│ 50.0        │ 30.0  │ 50.0  │ 6     │
│ inf         │ 50.0  │ inf   │ 12    │
│ 30.0        │ 20.0  │ 30.0  │ 1     │
│ 20.0        │ 10.0  │ 20.0  │ 1     │
└─────────────┴───────┴───────┴───────┘

英文:

Perhaps there is a better way, but you could perform the manual steps to split the result and turn it into float columns:

hist = df.get_column(&#39;j&#39;).hist([10, 20, 30, 50])

hist.with_columns(
   pl.col(&#39;category&#39;)
     .cast(pl.Utf8)
     .str.strip(&#39;(]&#39;)
     .str.split(&#39;, &#39;)
     .cast(pl.List(pl.Float64))
     .list.to_struct(fields = [&#39;lower&#39;, &#39;upper&#39;])
).unnest(&#39;category&#39;)

shape: (5, 4)
┌─────────────┬───────┬───────┬─────────┐
│ break_point ┆ lower ┆ upper ┆ j_count │
│ ---         ┆ ---   ┆ ---   ┆ ---     │
│ f64         ┆ f64   ┆ f64   ┆ u32     │
╞═════════════╪═══════╪═══════╪═════════╡
│ 10.0        ┆ -inf  ┆ 10.0  ┆ 0       │
│ 20.0        ┆ 10.0  ┆ 20.0  ┆ 1       │
│ 30.0        ┆ 20.0  ┆ 30.0  ┆ 1       │
│ 50.0        ┆ 30.0  ┆ 50.0  ┆ 6       │
│ inf         ┆ 50.0  ┆ inf   ┆ 12      │
└─────────────┴───────┴───────┴─────────┘

Update: Maybe you could emulate the counts using expressions.

You could create a function, but something like:

bins = [10, 20, 30, 50]

df.with_columns(hist = 
   pl.coalesce(
      pl.when(pl.col(&#39;j&#39;).is_between(lower, upper, closed=&#39;right&#39;))
        .then(pl.struct(break_point=upper, lower=lower, upper=upper))
      for bins in [[float(&#39;-inf&#39;)] + bins + [float(&#39;inf&#39;)]]
      for idx  in range(len(bins) - 1)
      for lower, upper in [[bins[idx], bins[idx + 1]]]
   )
).groupby(&#39;hist&#39;).count().unnest(&#39;hist&#39;)

shape: (4, 4)
┌─────────────┬───────┬───────┬───────┐
│ break_point ┆ lower ┆ upper ┆ count │
│ ---         ┆ ---   ┆ ---   ┆ ---   │
│ f64         ┆ f64   ┆ f64   ┆ u32   │
╞═════════════╪═══════╪═══════╪═══════╡
│ 50.0        ┆ 30.0  ┆ 50.0  ┆ 6     │
│ inf         ┆ 50.0  ┆ inf   ┆ 12    │
│ 30.0        ┆ 20.0  ┆ 30.0  ┆ 1     │
│ 20.0        ┆ 10.0  ┆ 20.0  ┆ 1     │
└─────────────┴───────┴───────┴───────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

处理Polars中的分类列

问题

答案1

从一个包含 n 个数字的列表中选择样本，不重复。

Django自定义用户模型图像字段显示在用户列表中

可以强制 tkinter.Text 小部件在“空格”字符以及单词上换行吗？

“Modules inside Firebase Cloud Functions with Python: ModuleNotFoundError: No module named ‘src'”

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论