2023年2月14日 00:56:48go评论93阅读模式

英文:

How to generate data using default_rng choice so that the generated data is imbalanced?

问题

I'm trying to generate some fake experimental data to play around with. What I'm struggling with is to determine the balance of the data for some variables/features.

Most importantly, the 'response' variable would ideally be imbalanced so that I can paly around with the treatment effect in the analysis later. But also in general so that the data is not super uniform.

All I can think of is to maybe not generate the data from a normal distribution, but this doesn't give me proper control.

This is my code so far:

rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
    &#39;treatment&#39;: rng.integers(0,2, size=size),
    &#39;response&#39;: rng.integers (0,2, size=size)
})

When I run this I get:

df.groupby(&#39;treatment&#39;)[&#39;response&#39;].mean()
0: 0.501696
1: 0.513530

But would like to have control over what the difference between treatment 0 and treatment 1. And generally just more control over all the variables.

Edit for solution:
Thanks to @noob and @lezaf for pointing me in the right direction.

The issue is that while you can get rng.choice to bias the probability of the choice, it does so equally for everyone and doesn't do it across treatment and control groups.

My workaround is to create two response variables with each bias and to then create a new response variable that takes its value from the previous two response variables depending on their treatment assignment.

rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
    &#39;treatment&#39;: rng.integers(0,2, size=size),
    &#39;response1&#39;: rng.choice ([0,1], size=size, p=[0.1, 0.9]),
    &#39;response2&#39;: rng.choice ([0,1], size=size, p=[0.9, 0.1])
})
df[&#39;response&#39;] = np.where(df[&#39;treatment&#39;]==1, df[&#39;response1&#39;], df[&#39;response2&#39;])

The alternative solution would be to create two separate DFs for treatment and control with their own probabilities, then concat.

英文:

I'm trying to generate some fake experimental data to play around with. What I'm struggling with is to determine the balance of the data for some variables/features.

All I can think of is to maybe not generate the data from a normal distribution, but this doesn't give me proper control.

This is my code so far:

rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
    &#39;treatment&#39;: rng.integers(0,2, size=size),
    &#39;response&#39;: rng.integers (0,2, size=size)
    
})

When I run this I get:

df.groupby(&#39;treatment&#39;)[&#39;response&#39;].mean()

0: 0.501696

1: 0.513530

But would like to have control over what the difference between treatment 0 and treatment 1. And generally just more control over all the variables.

Edit for solution:
Thanks to @noob and @lezaf for pointing me in the right direction.

The issue is that while you can get rng.choice to bias the probability of the choice, it does so equally for everyone and doesn't do it across treatment and control groups.

rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
    &#39;treatment&#39;: rng.integers(0,2, size=size),
    &#39;response1&#39;: rng.choice ([0,1], size=size, p=[0.1, 0.9]),
    &#39;response2&#39;: rng.choice ([0,1], size=size, p=[0.9, 0.1])
    
})
df[&#39;response&#39;] = np.where(df[&#39;treatment&#39;]==1, df[&#39;response1&#39;], df[&#39;response2&#39;])

The alternative solution would be to create two separate DFs for treatment and control with their own probabilities, then concat.

答案1

得分: 1

尝试这个

#你可以根据需要设置概率
probs = [0.9, 0.1]
df['response'] = rng.choice([0, 1], size=size, p=probs)

英文:

try this

#you can set the probabilities as you like
probs = [0.9, 0.1]
df[&#39;response&#39;] = rng.choice([0, 1], size=size, p=probs)

答案2

得分: 1

首先，response 是一个二进制属性，因为你使用 low=0（包含） 到 high=2（不包含） 的值来创建它。因此，在两种情况下都不可能获得你提到的 mean > 1 的值。那么，你是如何获得这些值的呢？

无论如何，你可以通过更改参数 p 来控制使用 numpy.random.Generator.choice 生成的数据的偏倚。根据文档：

p: 1-D array_like, optional
与 a 中的每个条目相关联的概率。如果未提供，则采样假定在 a 的所有条目上具有均匀分布。

简化示例：

In [1]: rng = np.random.default_rng(seed=42)
In [2]: df = pd.DataFrame({'treatment': rng.choice([0,1], size=10000, p=[0.5, 0.5]),
                           'response': rng.choice([0,1], size=10000, p=[0.3, 0.7])})
In [3]: df.groupby('treatment')['response'].mean()
Out[3]:
treatment
0    0.702729
1    0.705992

这是有道理的，因为 response 以 bias=0.7 的概率取值为1。此外，与 response 的均值无关的 treatment 属性的概率在示例中是均匀的。

英文:

Firstly, response is a binary attribute since you create it with values from low=0 (inclusive) to high=2 (exclusive). Therefore, it is not possible to get a mean > 1 as you mention in both cases. So, how did you get these values?

Anyway, you can control the bias of the data generated using numpy.random.Generator.choice changing parameter p. From docs:

> p: 1-D array_like, optional <br>
The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Simplified example:

In [1]: rng = np.random.default_rng(seed=42)
In [2]: df = pd.DataFrame({&#39;treatment&#39;: rng.choice([0,1], size=10000, p=[0.5, 0.5]),
                           &#39;response&#39;: rng.choice([0,1], size=10000, p=[0.3, 0.7])})
In [3]: df.groupby(&#39;treatment&#39;)[&#39;response&#39;].mean()
Out[3]:
treatment
0    0.702729
1    0.705992

Which makes sense since response takes value 1 with a bias=0.7. Also, probabilities for treatment attribute are irrelevant to the mean of response, thus I made it uniform in the example.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用`default_rng.choice`生成数据，以便生成的数据是不平衡的？

问题

答案1

答案2

怎样用更简洁的代码从列表中同时为两个变量编写循环

如何遍历一个元组列表并将它们配对？

遇到一个语法错误，当我想根据列的数值删除行时。

如何在Celery任务中使用SQLAlchemy中的AsyncSession？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。