如何在 Polars 中使用随机值填充列

huangapple go评论107阅读模式
英文:

How to fill a column with random values in polars

问题

I would like to know how to fill a column of a polars dataframe with random values.
The idea is that I have a dataframe with a given number of columns, and I want to add a column to this dataframe which is filled with different random values (obtained from a random.random() function for example).

This is what I tried for now:

df = df.with_columns(pl.when((pl.col('Q') > 0)).then(random.random()).otherwise(pl.lit(1)).alias('Prob'))

With this method, the result that I obtain is a column filled with one random value i.e. all the rows have the same value.

Is there a way to fill the column with different random values?

Thanks by advance.

英文:

I would like to know how to fill a column of a polars dataframe with random values.
The idea is that I have a dataframe with a given number of columns, and I want to add a column to this dataframe which is filled with different random values (obtained from a random.random() function for example).

This is what I tried for now:

df = df.with_columns(pl.when((pl.col('Q') > 0)).then(random.random()).otherwise(pl.lit(1)).alias('Prob'))

With this method, the result that I obtain is a column filled with one random value i.e. all the rows have the same value.

Is there a way to fill the column with different random values ?

Thanks by advance.

答案1

得分: 3

你需要一个与你的数据框相同高度的“随机数”列。

np.random.rand 对此很有用:

>>> df.with_columns(random = pl.lit(np.random.rand(df.height)))
shape: (3, 2)
┌─────┬──────────┐
 foo  random   
 ---  ---      
 i64  f64      
╞═════╪══════════╡
 1    0.51566  
 2    0.009299 
 3    0.519169 
└─────┴──────────┘
>>> df.with_columns(random = pl.when(pl.col("foo") > 2).then(pl.lit(np.random.rand(df.height))))
shape: (3, 2)
┌─────┬──────────┐
 foo  random   
 ---  ---      
 i64  f64      
╞═════╪══════════╡
 1    null     
 2    null     
 3    0.926295 
└─────┴──────────┘
英文:

You need a "column" of random numbers the same height as your dataframe.

np.random.rand is useful for this:

>>> df.with_columns(random = pl.lit(np.random.rand(df.height)))
shape: (3, 2)
┌─────┬──────────┐
│ foo ┆ random   │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 1   ┆ 0.51566  │
│ 2   ┆ 0.009299 │
│ 3   ┆ 0.519169 │
└─────┴──────────┘
>>> df.with_columns(random = pl.when(pl.col("foo") > 2).then(pl.lit(np.random.rand(df.height))))
shape: (3, 2)
┌─────┬──────────┐
│ foo ┆ random   │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 1   ┆ null     │
│ 2   ┆ null     │
│ 3   ┆ 0.926295 │
└─────┴──────────┘

答案2

得分: 2

首先获取您的数据框的行数:

row_n = df.select(pl.count()).collect().items()

然后使用随机函数创建与该大小相符的随机列表:

to_add = random.sample(range(0, 10), row_n)

最后将其添加到您的数据框中:

df.with_column(pl.Series(name="new_col", values=to_add))
英文:

First get the number of rows of your dataframe:

row_n = df.select(pl.count()).collect().items()

Then create a random list of that size using random:

to_add = random.sample(range(0, 10), row_n)

And finally add it to your dataframe:

df.with_column(pl.Series(name="new_col", values=to_add))

答案3

得分: 2

Create a sample polars dataframe

df = pl.DataFrame({
    'Q': [1, -1, -3, 4],
})

One liner vectorized calculation

df = df.with_columns(
    pl.when(pl.col('Q') > 0)
    .then(pl.lit(np.random.uniform(0, 1, len(df))))
    .otherwise(1)
    .alias('Prob')
)

Result

Q	Prob
1	0.922802
-1	1.0
-3	1.0
4	0.182397

英文:

Create a sample polars dataframe

df = pl.DataFrame({
    'Q': [1, -1, -3, 4],
})

One liner vectorised calculation

df = df.with_columns(
    pl.when(pl.col('Q') > 0)
    .then(pl.lit(np.random.uniform(0, 1, len(df))))
    .otherwise(1)
    .alias('Prob')
)

Result

Q	Prob
1	0.922802
-1	1.0
-3	1.0
4	0.182397

答案4

得分: 1

首先,如果您仍在使用 with_column 而不是 with_columns,那么您的 Polars 版本相对较旧,我建议升级,因为已经有了新功能和性能增强。还有一些破坏性变更,例如不再支持 with_column,因为它与 with_columns 重复,从一开始就是它的一种限制版本。

不管怎样,针对您的问题,它无法正常工作的原因是,当您运行以下代码时:

df.with_columns(pl.when((pl.col('Q') > 0)).then(random.random()).otherwise(pl.lit(1)).alias('Prob'))

Python 只调用了一次 random.random(),并且由于它只返回一个值,Polars 将其广播(即复制)到所有行。您需要做的是告诉 Python 在您实际“需要”的所有时间运行它。我在“需要”一词上加了引号,因为即使您只需要与 Q>0 的行数相同数量的随机值,Polars 也会在您尝试提供比数据帧整体高度少的值时发出警告。

最简单的方法就是使用列表推导式,将 df 的高度插入其中:

df.with_columns(
    pl.when((pl.col('Q') > 0))
        .then(pl.lit([random.random() for _ in range(df.height)]))
        .otherwise(pl.lit(1))
        .alias('Prob'))

使用列表推导式生成 random.random() 不如使用 numpy 创建一个随机数数组高效,因为 numpy 使用优化的 C 代码来执行此操作,而列表推导式只是一个 Python 循环。我更关注为什么它无法正常工作的整体问题,而不是提供最快的随机数生成方法。

英文:

Firstly, you're on a relatively old version of polars if you're still using with_column rather than with_columns so I'd recommend upgrading as there have been new features and performance enhancements. There are also breaking changes, like no more with_column as it was redundant given that it was just a limiting version of with_columns from the beginning.

Setting that aside, and to your issue, the reason it isn't working is that when you run

df.with_columns(pl.when((pl.col('Q') > 0)).then(random.random()).otherwise(pl.lit(1)).alias('Prob'))

python is only calling random.random() once and since it only returns one value, polars broadcasts (ie. copies) it to all the rows. What you need to do is tell python to run it all the times you actually "need". I put need in quotes because polars will complain if you try to give it fewer values than the whole height of the df even though you only need as many random values as there are Q>0.

The easiest way to do that is just with a list comprehension plugging in the height of the df

df.with_columns(
    pl.when((pl.col('Q') > 0))
        .then(pl.lit([random.random() for _ in range(df.height)]))
        .otherwise(pl.lit(1))
        .alias('Prob'))

Using a list comprehension for random.random() isn't as efficient as having numpy create an array of random numbers as it uses optimized C code to do it whereas a list comprehension is just a python loop. I was going for answering the overall question of why it wasn't working rather than to prescribe the method for quickest random number generation.

答案5

得分: 0

df.with_columns(
    pl.Series(
        [random.random() if q > 0 else 1 for q in df["Q"]]
    ).alias("Prob")
)
英文:

You could do:

df.with_columns(
    pl.Series(
        [random.random() if q > 0 else 1 for q in df["Q"]]
    ).alias("Prob")
)

huangapple
  • 本文由 发表于 2023年4月19日 16:13:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76052153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定