Polars 从虚拟变量转换回

huangapple go评论86阅读模式
英文:

Polars Convert Back From Dummies

问题

  1. # 在pandas中,我可以使用[`from_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.from_dummies.html)方法来反转独热编码。在polars中似乎没有内置的方法来做到这一点。这里是一个基本的例子:
  2. pl.DataFrame({
  3. "col1_hi": [0,0,0,1,1],
  4. "col1_med": [0,0,1,0,0],
  5. "col1_lo": [1,1,0,0,0],
  6. "col2_yes": [1,1,0,1,0],
  7. "col2_no": [0,0,1,0,1],
  8. })

反转to_dummies操作应该得到类似这样的结果:

  1. pl.DataFrame({
  2. "col1": ["lo", "lo", "med", "hi", "hi"],
  3. "col2": ["yes", "yes", "no", "yes", "no"],
  4. })

我的第一反应是使用pivot。我该如何实现这个功能?

英文:

In pandas I can use the from_dummies method to reverse one-hot encoding. There doesn't seem to be a built in method for this in polars. Here is a basic example:

  1. pl.DataFrame({
  2. "col1_hi": [0,0,0,1,1],
  3. "col1_med": [0,0,1,0,0],
  4. "col1_lo": [1,1,0,0,0],
  5. "col2_yes": [1,1,0,1,0],
  6. "col2_no": [0,0,1,0,1],
  7. })
  8. ┌─────────┬──────────┬─────────┬──────────┬─────────┐
  9. col1_hi col1_med col1_lo col2_yes col2_no
  10. --- --- --- --- ---
  11. i64 i64 i64 i64 i64
  12. ╞═════════╪══════════╪═════════╪══════════╪═════════╡
  13. 0 0 1 1 0
  14. 0 0 1 1 0
  15. 0 1 0 0 1
  16. 1 0 0 1 0
  17. 1 0 0 0 1
  18. └─────────┴──────────┴─────────┴──────────┴─────────┘

Reversing the to_dummies operation should result in something like this:

  1. ┌──────┬──────┐
  2. col1 col2
  3. --- ---
  4. str str
  5. ╞══════╪══════╡
  6. lo yes
  7. lo yes
  8. med no
  9. hi yes
  10. hi no
  11. └──────┴──────┘

My first thought was to use a pivot. How could I go about implementing this functionality?

答案1

得分: 4

你可以利用 pl.coalesce

  1. (df
  2. .with_columns(
  3. pl.when(pl.col(col) == 1)
  4. .then(pl.lit(col).str.extract(r"([^_]+$)"))
  5. .alias(col)
  6. for col in df.columns)
  7. .select(
  8. pl.coalesce(pl.col(f"^{prefix}_.+$")).alias(prefix)
  9. for prefix in dict.fromkeys(
  10. col.rsplit("_", maxsplit=1)[0]
  11. for col in df.columns
  12. )
  13. ))
  1. 形状:(52
  2. ┌──────┬──────┐
  3. col1 col2
  4. --- ---
  5. str str
  6. ╞══════╪══════╡
  7. lo yes
  8. lo yes
  9. med no
  10. hi yes
  11. hi no
  12. └──────┴──────┘

更新: @Rodalm's方法 更简洁:

  1. def from_dummies(df, separator="_"):
  2. col_exprs = {}
  3. for col in df.columns:
  4. name, value = col.rsplit(separator, maxsplit=1)
  5. expr = pl.when(pl.col(col) == 1).then(value)
  6. col_exprs.setdefault(name, []).append(expr)
  7. return df.select(
  8. pl.coalesce(exprs) # 保留每行的第一个非空表达式值
  9. .alias(name)
  10. for name, exprs in col_exprs.items()
  11. )
英文:

You could utilize pl.coalesce

  1. (df
  2. .with_columns(
  3. pl.when(pl.col(col) == 1)
  4. .then(pl.lit(col).str.extract(r"([^_]+$)"))
  5. .alias(col)
  6. for col in df.columns)
  7. .select(
  8. pl.coalesce(pl.col(f"^{prefix}_.+$")).alias(prefix)
  9. for prefix in dict.fromkeys(
  10. col.rsplit("_", maxsplit=1)[0]
  11. for col in df.columns
  12. )
  13. ))
  1. shape: (5, 2)
  2. ┌──────┬──────┐
  3. col1 col2
  4. --- ---
  5. str str
  6. ╞══════╪══════╡
  7. lo yes
  8. lo yes
  9. med no
  10. hi yes
  11. hi no
  12. └──────┴──────┘

Update: @Rodalm's approach is much neater:

  1. def from_dummies(df, separator="_"):
  2. col_exprs = {}
  3. for col in df.columns:
  4. name, value = col.rsplit(separator, maxsplit=1)
  5. expr = pl.when(pl.col(col) == 1).then(value)
  6. col_exprs.setdefault(name, []).append(expr)
  7. return df.select(
  8. pl.coalesce(exprs) # keep the first non-null expression value by row
  9. .alias(name)
  10. for name, exprs in col_exprs.items()
  11. )

答案2

得分: 2

使用 pl.coalesce 的方法,类似于 @jqurious's answer

  1. from collections import defaultdict
  2. import polars as pl
  3. df = pl.DataFrame({
  4. "col1_hi": [0,0,0,1,1],
  5. "col1_med": [0,0,1,0,0],
  6. "col1_lo": [1,1,0,0,0],
  7. "col2_yes": [1,1,0,1,0],
  8. "col2_no": [0,0,1,0,1],
  9. })
  10. def from_dummies(df, sep="_"):
  11. col_exprs = defaultdict(list)
  12. for col in df.columns:
  13. name, value = col.split(sep)
  14. expr = pl.when(pl.col(col) == 1).then(value) # null otherwise
  15. col_exprs[name].append(expr)
  16. res = df.select(**{
  17. name: pl.coalesce(exprs) # keep the first non-null expression value by row
  18. for name, exprs in col_exprs.items()
  19. })
  20. return res

或者是泛化 @warwick12's approach,使用多个 whenthen 连接的方法:

  1. def from_dummies(df, sep="_"):
  2. col_exprs = {}
  3. for col in df.columns:
  4. name, value = col.split(sep)
  5. if name not in col_exprs:
  6. col_exprs[name] = pl.when(pl.col(col) == 1).then(value)
  7. else:
  8. col_exprs[name] = col_exprs[name].when(pl.col(col) == 1).then(value)
  9. return df.select(**col_exprs)

输出:

  1. >>> from_dummies(df)
  2. shape: (5, 2)
  3. ┌──────┬──────┐
  4. col1 col2
  5. --- ---
  6. str str
  7. ╞══════╪══════╡
  8. lo yes
  9. lo yes
  10. med no
  11. hi yes
  12. hi no
  13. └──────┴──────┘
英文:

A similar approach to @jqurious's answer using pl.coalesce:

  1. from collections import defaultdict
  2. import polars as pl
  3. df = pl.DataFrame({
  4. "col1_hi": [0,0,0,1,1],
  5. "col1_med": [0,0,1,0,0],
  6. "col1_lo": [1,1,0,0,0],
  7. "col2_yes": [1,1,0,1,0],
  8. "col2_no": [0,0,1,0,1],
  9. })
  10. def from_dummies(df, sep="_"):
  11. col_exprs = defaultdict(list)
  12. for col in df.columns:
  13. name, value = col.split(sep)
  14. expr = pl.when(pl.col(col) == 1).then(value) # null otherwise
  15. col_exprs[name].append(expr)
  16. res = df.select(**{
  17. name: pl.coalesce(exprs) # keep the first non-null expression value by row
  18. for name, exprs in col_exprs.items()
  19. })
  20. return res

Or generalizing @warwick12's approach using multiple when and thens chained:

  1. def from_dummies(df, sep="_"):
  2. col_exprs = {}
  3. for col in df.columns:
  4. name, value = col.split(sep)
  5. if name not in col_exprs:
  6. col_exprs[name] = pl.when(pl.col(col) == 1).then(value)
  7. else:
  8. col_exprs[name] = col_exprs[name].when(pl.col(col) == 1).then(value)
  9. return df.select(**col_exprs)

Output:

  1. >>> from_dummies(df)
  2. shape: (5, 2)
  3. ┌──────┬──────┐
  4. col1 col2
  5. --- ---
  6. str str
  7. ╞══════╪══════╡
  8. lo yes
  9. lo yes
  10. med no
  11. hi yes
  12. hi no
  13. └──────┴──────┘

答案3

得分: 1

你可以使用 pl.when()、pl.col() 和 pl.lit() 方法将包含虚拟变量的 Polars DataFrame 转换回原始格式。这将每列的虚拟变量映射回其原始值。

  1. # 创建虚拟变量的 DataFrame
  2. df = pl.DataFrame({
  3. "col1_hi": [0,0,0,1,1],
  4. "col1_med": [0,0,1,0,0],
  5. "col1_lo": [1,1,0,0,0],
  6. "col2_yes": [1,1,0,1,0],
  7. "col2_no": [0,0,1,0,1],
  8. })
  9. # 将虚拟变量映射回原始值
  10. df = df.select([
  11. pl.when(pl.col("col1_hi") == 1).then(pl.lit("hi"))
  12. .when(pl.col("col1_med") == 1).then(pl.lit("med")).otherwise("lo").alias("col1"),
  13. pl.when(pl.col("col2_yes") == 1).then(pl.lit("yes")).otherwise("no").alias("col2")
  14. ])
  15. # 显示原始 DataFrame
  16. print(df)
英文:

You can use the pl.when(), pl.col() and pl.lit() methods to convert a polars DataFrame with dummy variables back to the original format. This map's each column's dummies back to their original values.

  1. import polars as pl
  2. # Create dummy variable DataFrame
  3. df = pl.DataFrame({
  4. "col1_hi": [0,0,0,1,1],
  5. "col1_med": [0,0,1,0,0],
  6. "col1_lo": [1,1,0,0,0],
  7. "col2_yes": [1,1,0,1,0],
  8. "col2_no": [0,0,1,0,1],
  9. })
  10. # Map dummies back to original values
  11. df = df.select([
  12. pl.when(pl.col("col1_hi") == 1).then(pl.lit("hi"))
  13. .when(pl.col("col1_med") == 1).then(pl.lit("med")).otherwise("lo").alias("col1"),
  14. pl.when(pl.col("col2_yes") == 1).then(pl.lit("yes")).otherwise("no").alias("col2")
  15. ])
  16. # Display original DataFrame
  17. print(df)

Output:

  1. shape: (5, 2)
  2. ┌──────┬──────┐
  3. col1 col2
  4. --- ---
  5. str str
  6. ╞══════╪══════╡
  7. lo yes
  8. lo yes
  9. med no
  10. hi yes
  11. hi no
  12. └──────┴──────┘

答案4

得分: 1

你可以这样进行融合/拆分/过滤/旋转:

  1. df \
  2. .with_row_count("i") \
  3. .melt('i') \
  4. .with_columns(pl.col('variable').str.split('_')) \
  5. .with_columns(col=pl.col('variable').arr.first(), val=pl.col('variable').arr.last()) \
  6. .filter(pl.col('value')==1) \
  7. .pivot('val','i','col') \
  8. .sort('i').drop('i')
英文:

You can do a melt/split/filter/pivot like this:

  1. df \
  2. .with_row_count("i") \
  3. .melt('i') \
  4. .with_columns(pl.col('variable').str.split('_')) \
  5. .with_columns(col=pl.col('variable').arr.first(), val=pl.col('variable').arr.last()) \
  6. .filter(pl.col('value')==1) \
  7. .pivot('val','i','col') \
  8. .sort('i').drop('i')

huangapple
  • 本文由 发表于 2023年4月6日 18:57:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75948718.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定