处理Polars中的分类列

huangapple go评论69阅读模式
英文:

Working with categorical columns in polars

问题

如何处理category列?例如,如何筛选category列中包含-inf的值或上限介于10.030.0之间的值,或类似的操作?

英文:

Let's say I have this:

>>> df = polars.DataFrame(dict(j=numpy.random.randint(10, 99, 20)))
>>> df
shape: (20, 1)
┌─────┐
│ j   │
│ --- │
│ i64 │
╞═════╡
│ 47  │
│ 22  │
│ 82  │
│ 19  │
│ …   │
│ 28  │
│ 94  │
│ 21  │
│ 38  │
└─────┘
>>> df.get_column('j').hist([10, 20, 30, 50])
shape: (5, 3)
┌─────────────┬──────────────┬─────────┐
│ break_point ┆ category     ┆ j_count │
│ ---         ┆ ---          ┆ ---     │
│ f64         ┆ cat          ┆ u32     │
╞═════════════╪══════════════╪═════════╡
│ 10.0        ┆ (-inf, 10.0] ┆ 0       │
│ 20.0        ┆ (10.0, 20.0] ┆ 4       │
│ 30.0        ┆ (20.0, 30.0] ┆ 5       │
│ 50.0        ┆ (30.0, 50.0] ┆ 3       │
│ inf         ┆ (50.0, inf]  ┆ 8       │
└─────────────┴──────────────┴─────────┘

How would I go with doing something with the category column? For example, how would I filter values where cateogry has -inf or where upper bound is between 10.0 and 30.0 or something along those lines?

答案1

得分: 1

也许有更好的方法,但您可以执行手动步骤来拆分结果并将其转换为浮点列:

hist = df.get_column('j').hist([10, 20, 30, 50])

hist.with_columns(
   pl.col('category')
     .cast(pl.Utf8)
     .str.strip('()')
     .str.split(', ')
     .cast(pl.List(pl.Float64))
     .list.to_struct(fields=['lower', 'upper'])
).unnest('category')
shape: (5, 4)
┌─────────────┬───────┬───────┬─────────┐
 break_point  lower  upper  j_count 
 ---          ---    ---    ---     
 f64          f64    f64    u32     
╞═════════════╪═══════╪═══════╪═════════╡
 10.0         -inf   10.0   0       
 20.0         10.0   20.0   1       
 30.0         20.0   30.0   1       
 50.0         30.0   50.0   6       
 inf          50.0   inf    12      
└─────────────┴───────┴───────┴─────────┘

**更新:**也许您可以使用表达式来模拟计数。

您可以创建一个函数,但类似以下的方式也可以:

bins = [10, 20, 30, 50]

df.with_columns(hist = 
   pl.coalesce(
      pl.when(pl.col('j').is_between(lower, upper, closed='right'))
        .then(pl.struct(break_point=upper, lower=lower, upper=upper))
      for bins in [[float('-inf')] + bins + [float('inf')]]
      for idx  in range(len(bins) - 1)
      for lower, upper in [[bins[idx], bins[idx + 1]]]
   )
).groupby('hist').count().unnest('hist')
shape: (4, 4)
┌─────────────┬───────┬───────┬───────┐
 break_point  lower  upper  count 
 ---          ---    ---    ---   
 f64          f64    f64    u32   
╞═════════════╪═══════╪═══════╪═══════╡
 50.0         30.0   50.0   6     
 inf          50.0   inf    12    
 30.0         20.0   30.0   1     
 20.0         10.0   20.0   1     
└─────────────┴───────┴───────┴───────┘
英文:

Perhaps there is a better way, but you could perform the manual steps to split the result and turn it into float columns:

hist = df.get_column('j').hist([10, 20, 30, 50])

hist.with_columns(
   pl.col('category')
     .cast(pl.Utf8)
     .str.strip('(]')
     .str.split(', ')
     .cast(pl.List(pl.Float64))
     .list.to_struct(fields = ['lower', 'upper'])
).unnest('category')
shape: (5, 4)
┌─────────────┬───────┬───────┬─────────┐
│ break_point ┆ lower ┆ upper ┆ j_count │
│ ---         ┆ ---   ┆ ---   ┆ ---     │
│ f64         ┆ f64   ┆ f64   ┆ u32     │
╞═════════════╪═══════╪═══════╪═════════╡
│ 10.0        ┆ -inf  ┆ 10.0  ┆ 0       │
│ 20.0        ┆ 10.0  ┆ 20.0  ┆ 1       │
│ 30.0        ┆ 20.0  ┆ 30.0  ┆ 1       │
│ 50.0        ┆ 30.0  ┆ 50.0  ┆ 6       │
│ inf         ┆ 50.0  ┆ inf   ┆ 12      │
└─────────────┴───────┴───────┴─────────┘

Update: Maybe you could emulate the counts using expressions.

You could create a function, but something like:

bins = [10, 20, 30, 50]

df.with_columns(hist = 
   pl.coalesce(
      pl.when(pl.col('j').is_between(lower, upper, closed='right'))
        .then(pl.struct(break_point=upper, lower=lower, upper=upper))
      for bins in [[float('-inf')] + bins + [float('inf')]]
      for idx  in range(len(bins) - 1)
      for lower, upper in [[bins[idx], bins[idx + 1]]]
   )
).groupby('hist').count().unnest('hist')
shape: (4, 4)
┌─────────────┬───────┬───────┬───────┐
│ break_point ┆ lower ┆ upper ┆ count │
│ ---         ┆ ---   ┆ ---   ┆ ---   │
│ f64         ┆ f64   ┆ f64   ┆ u32   │
╞═════════════╪═══════╪═══════╪═══════╡
│ 50.0        ┆ 30.0  ┆ 50.0  ┆ 6     │
│ inf         ┆ 50.0  ┆ inf   ┆ 12    │
│ 30.0        ┆ 20.0  ┆ 30.0  ┆ 1     │
│ 20.0        ┆ 10.0  ┆ 20.0  ┆ 1     │
└─────────────┴───────┴───────┴───────┘

huangapple
  • 本文由 发表于 2023年6月9日 09:22:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436620.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定