英文:
Working with categorical columns in polars
问题
如何处理category
列?例如,如何筛选category
列中包含-inf
的值或上限介于10.0
和30.0
之间的值,或类似的操作?
英文:
Let's say I have this:
>>> df = polars.DataFrame(dict(j=numpy.random.randint(10, 99, 20)))
>>> df
shape: (20, 1)
┌─────┐
│ j │
│ --- │
│ i64 │
╞═════╡
│ 47 │
│ 22 │
│ 82 │
│ 19 │
│ … │
│ 28 │
│ 94 │
│ 21 │
│ 38 │
└─────┘
>>> df.get_column('j').hist([10, 20, 30, 50])
shape: (5, 3)
┌─────────────┬──────────────┬─────────┐
│ break_point ┆ category ┆ j_count │
│ --- ┆ --- ┆ --- │
│ f64 ┆ cat ┆ u32 │
╞═════════════╪══════════════╪═════════╡
│ 10.0 ┆ (-inf, 10.0] ┆ 0 │
│ 20.0 ┆ (10.0, 20.0] ┆ 4 │
│ 30.0 ┆ (20.0, 30.0] ┆ 5 │
│ 50.0 ┆ (30.0, 50.0] ┆ 3 │
│ inf ┆ (50.0, inf] ┆ 8 │
└─────────────┴──────────────┴─────────┘
How would I go with doing something with the category
column? For example, how would I filter values where cateogry has -inf
or where upper bound is between 10.0
and 30.0
or something along those lines?
答案1
得分: 1
也许有更好的方法,但您可以执行手动步骤来拆分结果并将其转换为浮点列:
hist = df.get_column('j').hist([10, 20, 30, 50])
hist.with_columns(
pl.col('category')
.cast(pl.Utf8)
.str.strip('()')
.str.split(', ')
.cast(pl.List(pl.Float64))
.list.to_struct(fields=['lower', 'upper'])
).unnest('category')
shape: (5, 4)
┌─────────────┬───────┬───────┬─────────┐
│ break_point │ lower │ upper │ j_count │
│ --- │ --- │ --- │ --- │
│ f64 │ f64 │ f64 │ u32 │
╞═════════════╪═══════╪═══════╪═════════╡
│ 10.0 │ -inf │ 10.0 │ 0 │
│ 20.0 │ 10.0 │ 20.0 │ 1 │
│ 30.0 │ 20.0 │ 30.0 │ 1 │
│ 50.0 │ 30.0 │ 50.0 │ 6 │
│ inf │ 50.0 │ inf │ 12 │
└─────────────┴───────┴───────┴─────────┘
**更新:**也许您可以使用表达式来模拟计数。
您可以创建一个函数,但类似以下的方式也可以:
bins = [10, 20, 30, 50]
df.with_columns(hist =
pl.coalesce(
pl.when(pl.col('j').is_between(lower, upper, closed='right'))
.then(pl.struct(break_point=upper, lower=lower, upper=upper))
for bins in [[float('-inf')] + bins + [float('inf')]]
for idx in range(len(bins) - 1)
for lower, upper in [[bins[idx], bins[idx + 1]]]
)
).groupby('hist').count().unnest('hist')
shape: (4, 4)
┌─────────────┬───────┬───────┬───────┐
│ break_point │ lower │ upper │ count │
│ --- │ --- │ --- │ --- │
│ f64 │ f64 │ f64 │ u32 │
╞═════════════╪═══════╪═══════╪═══════╡
│ 50.0 │ 30.0 │ 50.0 │ 6 │
│ inf │ 50.0 │ inf │ 12 │
│ 30.0 │ 20.0 │ 30.0 │ 1 │
│ 20.0 │ 10.0 │ 20.0 │ 1 │
└─────────────┴───────┴───────┴───────┘
英文:
Perhaps there is a better way, but you could perform the manual steps to split the result and turn it into float columns:
hist = df.get_column('j').hist([10, 20, 30, 50])
hist.with_columns(
pl.col('category')
.cast(pl.Utf8)
.str.strip('(]')
.str.split(', ')
.cast(pl.List(pl.Float64))
.list.to_struct(fields = ['lower', 'upper'])
).unnest('category')
shape: (5, 4)
┌─────────────┬───────┬───────┬─────────┐
│ break_point ┆ lower ┆ upper ┆ j_count │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ u32 │
╞═════════════╪═══════╪═══════╪═════════╡
│ 10.0 ┆ -inf ┆ 10.0 ┆ 0 │
│ 20.0 ┆ 10.0 ┆ 20.0 ┆ 1 │
│ 30.0 ┆ 20.0 ┆ 30.0 ┆ 1 │
│ 50.0 ┆ 30.0 ┆ 50.0 ┆ 6 │
│ inf ┆ 50.0 ┆ inf ┆ 12 │
└─────────────┴───────┴───────┴─────────┘
Update: Maybe you could emulate the counts using expressions.
You could create a function, but something like:
bins = [10, 20, 30, 50]
df.with_columns(hist =
pl.coalesce(
pl.when(pl.col('j').is_between(lower, upper, closed='right'))
.then(pl.struct(break_point=upper, lower=lower, upper=upper))
for bins in [[float('-inf')] + bins + [float('inf')]]
for idx in range(len(bins) - 1)
for lower, upper in [[bins[idx], bins[idx + 1]]]
)
).groupby('hist').count().unnest('hist')
shape: (4, 4)
┌─────────────┬───────┬───────┬───────┐
│ break_point ┆ lower ┆ upper ┆ count │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ u32 │
╞═════════════╪═══════╪═══════╪═══════╡
│ 50.0 ┆ 30.0 ┆ 50.0 ┆ 6 │
│ inf ┆ 50.0 ┆ inf ┆ 12 │
│ 30.0 ┆ 20.0 ┆ 30.0 ┆ 1 │
│ 20.0 ┆ 10.0 ┆ 20.0 ┆ 1 │
└─────────────┴───────┴───────┴───────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论