2023年4月19日 16:13:04go评论167阅读模式

英文:

How to fill a column with random values in polars

问题

I would like to know how to fill a column of a polars dataframe with random values.
The idea is that I have a dataframe with a given number of columns, and I want to add a column to this dataframe which is filled with different random values (obtained from a random.random() function for example).

This is what I tried for now:

df = df.with_columns(pl.when((pl.col('Q') > 0)).then(random.random()).otherwise(pl.lit(1)).alias('Prob'))

With this method, the result that I obtain is a column filled with one random value i.e. all the rows have the same value.

Is there a way to fill the column with different random values?

Thanks by advance.

英文:

This is what I tried for now:

df = df.with_columns(pl.when((pl.col('Q') > 0)).then(random.random()).otherwise(pl.lit(1)).alias('Prob'))

With this method, the result that I obtain is a column filled with one random value i.e. all the rows have the same value.

Is there a way to fill the column with different random values ?

Thanks by advance.

答案1

得分: 3

你需要一个与你的数据框相同高度的“随机数”列。

np.random.rand 对此很有用：

>>> df.with_columns(random = pl.lit(np.random.rand(df.height)))
shape: (3, 2)
┌─────┬──────────┐
│ foo ┆ random   │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 1   ┆ 0.51566  │
│ 2   ┆ 0.009299 │
│ 3   ┆ 0.519169 │
└─────┴──────────┘

>>> df.with_columns(random = pl.when(pl.col("foo") > 2).then(pl.lit(np.random.rand(df.height))))
shape: (3, 2)
┌─────┬──────────┐
│ foo ┆ random   │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 1   ┆ null     │
│ 2   ┆ null     │
│ 3   ┆ 0.926295 │
└─────┴──────────┘

英文:

You need a "column" of random numbers the same height as your dataframe.

np.random.rand is useful for this:

&gt;&gt;&gt; df.with_columns(random = pl.lit(np.random.rand(df.height)))
shape: (3, 2)
┌─────┬──────────┐
│ foo ┆ random   │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 1   ┆ 0.51566  │
│ 2   ┆ 0.009299 │
│ 3   ┆ 0.519169 │
└─────┴──────────┘

&gt;&gt;&gt; df.with_columns(random = pl.when(pl.col(&quot;foo&quot;) &gt; 2).then(pl.lit(np.random.rand(df.height))))
shape: (3, 2)
┌─────┬──────────┐
│ foo ┆ random   │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 1   ┆ null     │
│ 2   ┆ null     │
│ 3   ┆ 0.926295 │
└─────┴──────────┘

答案2

得分: 2

首先获取您的数据框的行数：

row_n = df.select(pl.count()).collect().items()

然后使用随机函数创建与该大小相符的随机列表：

to_add = random.sample(range(0, 10), row_n)

最后将其添加到您的数据框中：

df.with_column(pl.Series(name="new_col", values=to_add))

英文:

First get the number of rows of your dataframe:

row_n = df.select(pl.count()).collect().items()

Then create a random list of that size using random:

to_add = random.sample(range(0, 10), row_n)

And finally add it to your dataframe:

df.with_column(pl.Series(name=&quot;new_col&quot;, values=to_add))

答案3

得分: 2

Create a sample polars dataframe

df = pl.DataFrame({
    'Q': [1, -1, -3, 4],
})

One liner vectorized calculation

df = df.with_columns(
    pl.when(pl.col('Q') > 0)
    .then(pl.lit(np.random.uniform(0, 1, len(df))))
    .otherwise(1)
    .alias('Prob')
)

Result

Q	Prob
1	0.922802
-1	1.0
-3	1.0
4	0.182397

英文:

Create a sample polars dataframe

df = pl.DataFrame({
    &#39;Q&#39;: [1, -1, -3, 4],
})

One liner vectorised calculation

df = df.with_columns(
    pl.when(pl.col(&#39;Q&#39;) &gt; 0)
    .then(pl.lit(np.random.uniform(0, 1, len(df))))
    .otherwise(1)
    .alias(&#39;Prob&#39;)
)

Result

Q	Prob
1	0.922802
-1	1.0
-3	1.0
4	0.182397

答案4

得分: 1

首先，如果您仍在使用 with_column 而不是 with_columns，那么您的 Polars 版本相对较旧，我建议升级，因为已经有了新功能和性能增强。还有一些破坏性变更，例如不再支持 with_column，因为它与 with_columns 重复，从一开始就是它的一种限制版本。

不管怎样，针对您的问题，它无法正常工作的原因是，当您运行以下代码时：

df.with_columns(pl.when((pl.col('Q') > 0)).then(random.random()).otherwise(pl.lit(1)).alias('Prob'))

Python 只调用了一次 random.random()，并且由于它只返回一个值，Polars 将其广播（即复制）到所有行。您需要做的是告诉 Python 在您实际“需要”的所有时间运行它。我在“需要”一词上加了引号，因为即使您只需要与 Q>0 的行数相同数量的随机值，Polars 也会在您尝试提供比数据帧整体高度少的值时发出警告。

最简单的方法就是使用列表推导式，将 df 的高度插入其中：

df.with_columns(
    pl.when((pl.col('Q') > 0))
        .then(pl.lit([random.random() for _ in range(df.height)]))
        .otherwise(pl.lit(1))
        .alias('Prob'))

使用列表推导式生成 random.random() 不如使用 numpy 创建一个随机数数组高效，因为 numpy 使用优化的 C 代码来执行此操作，而列表推导式只是一个 Python 循环。我更关注为什么它无法正常工作的整体问题，而不是提供最快的随机数生成方法。

英文:

Firstly, you're on a relatively old version of polars if you're still using with_column rather than with_columns so I'd recommend upgrading as there have been new features and performance enhancements. There are also breaking changes, like no more with_column as it was redundant given that it was just a limiting version of with_columns from the beginning.

Setting that aside, and to your issue, the reason it isn't working is that when you run

df.with_columns(pl.when((pl.col(&#39;Q&#39;) &gt; 0)).then(random.random()).otherwise(pl.lit(1)).alias(&#39;Prob&#39;))

python is only calling random.random() once and since it only returns one value, polars broadcasts (ie. copies) it to all the rows. What you need to do is tell python to run it all the times you actually "need". I put need in quotes because polars will complain if you try to give it fewer values than the whole height of the df even though you only need as many random values as there are Q>0.

The easiest way to do that is just with a list comprehension plugging in the height of the df

df.with_columns(
    pl.when((pl.col(&#39;Q&#39;) &gt; 0))
        .then(pl.lit([random.random() for _ in range(df.height)]))
        .otherwise(pl.lit(1))
        .alias(&#39;Prob&#39;))

Using a list comprehension for random.random() isn't as efficient as having numpy create an array of random numbers as it uses optimized C code to do it whereas a list comprehension is just a python loop. I was going for answering the overall question of why it wasn't working rather than to prescribe the method for quickest random number generation.

答案5

得分: 0

df.with_columns(
    pl.Series(
        [random.random() if q > 0 else 1 for q in df["Q"]]
    ).alias("Prob")
)

英文:

You could do:

df.with_columns(
    pl.Series(
        [random.random() if q &gt; 0 else 1 for q in df[&quot;Q&quot;]]
    ).alias(&quot;Prob&quot;)
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 Polars 中使用随机值填充列

问题

答案1

答案2

答案3

答案4

答案5

使用Beautiful Soup获取特定单词之后的文本。

递归似乎多迭代了一次，我该如何修复这段代码？

Polars 从虚拟变量转换回

在Python中查找两个轮廓之间的二进制图像中的所有像素。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。