问题

我已生成一个使用NumPy数组创建的大型模拟人口极坐标数据帧。我想从这个人口数据帧中多次随机抽样。然而，当我这样做时，每次抽样的样本都完全相同。我知道一定有一个简单的解决方法，有什么建议吗？这一定是重复函数的问题，有没有人有创造性的想法，可以模拟多个正交的随机样本？

这是我的代码：

N = 1000000 # 人口规模
samples = 1000 # 样本数量
num_obs = 100 # 每个样本的大小

# 生成人口数据
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# 将这些数据存储在人口数据帧中
pop_data_frame = pl.DataFrame({
    'A':a,
    'B':b,
    'X':x,
    'Z':z,
    'Y':y,
    'id':range(1, N+1)
})

# 从这个pop_data_frame中获取1000个样本...
#... 每个样本有100个观察值。
sample_list = list(
    repeat(
        pop_data_frame.sample(n=num_obs), samples)
    )
)

英文:

I have generated a large simulated population polars dataframe using numpy arrays. I want to randomly sample from this population dataframe multiple times. However, when I do that, the samples are exactly the same from sample to sample. I know there must be an easy fix for this, any recommendations? It must be the repeat function, does anyone have any creative ideas for how I can simulate orthogonal multiple random samples?

Here's my code:

N = 1000000 # population size
samples = 1000 # number of samples
num_obs = 100 # size of each sample

# Generate population data
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# Store this in a population dataframe
pop_data_frame = pl.DataFrame({
    &#39;A&#39;:a,
    &#39;B&#39;:b,
    &#39;X&#39;:x,
    &#39;Z&#39;:z,
    &#39;Y&#39;:y,
    &#39;id&#39;:range(1, N+1)
})

# Get 1000 samples from this pop_data_frame...
#... with 100 observations each sample.
sample_list = list(
    repeat(
        pop_data_frame.sample(n=num_obs), samples)
    )
)

答案1

得分: 2

使用.repeat()，您调用了.sample()一次，并重复了1000次。

您想要调用.sample() 1000次：

sample_list = [pop_data_frame.sample(num_obs) for _ in range(samples)]

或者，您可以使用Polar的延迟API创建一个LazyFrames列表，并使用.collect_all()，这应该更快，因为Polar可以并行化操作：

sample_list = pl.collect_all(
   [
      pop_data_frame.lazy().select(
         row = pl.struct(pl.all()).sample(num_obs)
      ).unnest("row")
      for _ in range(samples)
   ]
)

英文:

With .repeat(), you're calling .sample() once and repeating that 1000 times.

You want to call .sample() 1000 times:

sample_list = [ pop_data_frame.sample(num_obs) for _ in range(samples) ]

Or, you could use polars lazy API to create a list of lazyframes and .collect_all() which should be faster as polars can parallelize the operation:

sample_list = pl.collect_all(
   [
      pop_data_frame.lazy().select(
         row = pl.struct(pl.all()).sample(num_obs)
      ).unnest(&quot;row&quot;) 
      for _ in range(samples)
   ]
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Bootstrapping multiple random samples with polars in python.

问题

答案1

Pandas – 获取具有与特定行相同的选定列值的所有行

TypeError: 类型为Properties的对象不可JSON序列化 (Sagemaker管道)

在Python中，将单独的日志记录器分配给每个类实例，作为该实例的属性。

我有两个相同的项目，但只能在其中一个项目中运行Selenium webdriver。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论