Bootstrapping multiple random samples with polars in python.

huangapple go评论81阅读模式
英文:

Bootstrapping multiple random samples with polars in python

问题

我已生成一个使用NumPy数组创建的大型模拟人口极坐标数据帧。我想从这个人口数据帧中多次随机抽样。然而,当我这样做时,每次抽样的样本都完全相同。我知道一定有一个简单的解决方法,有什么建议吗?这一定是重复函数的问题,有没有人有创造性的想法,可以模拟多个正交的随机样本?

这是我的代码:

N = 1000000 # 人口规模
samples = 1000 # 样本数量
num_obs = 100 # 每个样本的大小

# 生成人口数据
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# 将这些数据存储在人口数据帧中
pop_data_frame = pl.DataFrame({
    'A':a,
    'B':b,
    'X':x,
    'Z':z,
    'Y':y,
    'id':range(1, N+1)
})

# 从这个pop_data_frame中获取1000个样本...
#... 每个样本有100个观察值。
sample_list = list(
    repeat(
        pop_data_frame.sample(n=num_obs), samples)
    )
)
英文:

I have generated a large simulated population polars dataframe using numpy arrays. I want to randomly sample from this population dataframe multiple times. However, when I do that, the samples are exactly the same from sample to sample. I know there must be an easy fix for this, any recommendations? It must be the repeat function, does anyone have any creative ideas for how I can simulate orthogonal multiple random samples?

Here's my code:

N = 1000000 # population size
samples = 1000 # number of samples
num_obs = 100 # size of each sample

# Generate population data
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# Store this in a population dataframe
pop_data_frame = pl.DataFrame({
    'A':a,
    'B':b,
    'X':x,
    'Z':z,
    'Y':y,
    'id':range(1, N+1)
})

# Get 1000 samples from this pop_data_frame...
#... with 100 observations each sample.
sample_list = list(
    repeat(
        pop_data_frame.sample(n=num_obs), samples)
    )
)

答案1

得分: 2

使用.repeat(),您调用了.sample()一次,并重复了1000次。

您想要调用.sample() 1000次:

sample_list = [pop_data_frame.sample(num_obs) for _ in range(samples)]

或者,您可以使用Polar的延迟API创建一个LazyFrames列表,并使用.collect_all(),这应该更快,因为Polar可以并行化操作:

sample_list = pl.collect_all(
   [
      pop_data_frame.lazy().select(
         row = pl.struct(pl.all()).sample(num_obs)
      ).unnest("row")
      for _ in range(samples)
   ]
)
英文:

With .repeat(), you're calling .sample() once and repeating that 1000 times.

You want to call .sample() 1000 times:

sample_list = [ pop_data_frame.sample(num_obs) for _ in range(samples) ]

Or, you could use polars lazy API to create a list of lazyframes and .collect_all() which should be faster as polars can parallelize the operation:

sample_list = pl.collect_all(
   [
      pop_data_frame.lazy().select(
         row = pl.struct(pl.all()).sample(num_obs)
      ).unnest("row") 
      for _ in range(samples)
   ]
)

huangapple
  • 本文由 发表于 2023年5月29日 23:30:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76358564.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定