英文:
Bootstrapping multiple random samples with polars in python
问题
我已生成一个使用NumPy数组创建的大型模拟人口极坐标数据帧。我想从这个人口数据帧中多次随机抽样。然而,当我这样做时,每次抽样的样本都完全相同。我知道一定有一个简单的解决方法,有什么建议吗?这一定是重复函数的问题,有没有人有创造性的想法,可以模拟多个正交的随机样本?
这是我的代码:
N = 1000000 # 人口规模
samples = 1000 # 样本数量
num_obs = 100 # 每个样本的大小
# 生成人口数据
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# 将这些数据存储在人口数据帧中
pop_data_frame = pl.DataFrame({
'A':a,
'B':b,
'X':x,
'Z':z,
'Y':y,
'id':range(1, N+1)
})
# 从这个pop_data_frame中获取1000个样本...
#... 每个样本有100个观察值。
sample_list = list(
repeat(
pop_data_frame.sample(n=num_obs), samples)
)
)
英文:
I have generated a large simulated population polars dataframe using numpy arrays. I want to randomly sample from this population dataframe multiple times. However, when I do that, the samples are exactly the same from sample to sample. I know there must be an easy fix for this, any recommendations? It must be the repeat function, does anyone have any creative ideas for how I can simulate orthogonal multiple random samples?
Here's my code:
N = 1000000 # population size
samples = 1000 # number of samples
num_obs = 100 # size of each sample
# Generate population data
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# Store this in a population dataframe
pop_data_frame = pl.DataFrame({
'A':a,
'B':b,
'X':x,
'Z':z,
'Y':y,
'id':range(1, N+1)
})
# Get 1000 samples from this pop_data_frame...
#... with 100 observations each sample.
sample_list = list(
repeat(
pop_data_frame.sample(n=num_obs), samples)
)
)
答案1
得分: 2
使用.repeat()
,您调用了.sample()
一次,并重复了1000次。
您想要调用.sample()
1000次:
sample_list = [pop_data_frame.sample(num_obs) for _ in range(samples)]
或者,您可以使用Polar的延迟API创建一个LazyFrames列表,并使用.collect_all()
,这应该更快,因为Polar可以并行化操作:
sample_list = pl.collect_all(
[
pop_data_frame.lazy().select(
row = pl.struct(pl.all()).sample(num_obs)
).unnest("row")
for _ in range(samples)
]
)
英文:
With .repeat()
, you're calling .sample()
once and repeating that 1000 times.
You want to call .sample()
1000 times:
sample_list = [ pop_data_frame.sample(num_obs) for _ in range(samples) ]
Or, you could use polars lazy API to create a list of lazyframes and .collect_all()
which should be faster as polars can parallelize the operation:
sample_list = pl.collect_all(
[
pop_data_frame.lazy().select(
row = pl.struct(pl.all()).sample(num_obs)
).unnest("row")
for _ in range(samples)
]
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论