英文:
How to generate data using default_rng choice so that the generated data is imbalanced?
问题
I'm trying to generate some fake experimental data to play around with. What I'm struggling with is to determine the balance of the data for some variables/features.
Most importantly, the 'response' variable would ideally be imbalanced so that I can paly around with the treatment effect in the analysis later. But also in general so that the data is not super uniform.
All I can think of is to maybe not generate the data from a normal distribution, but this doesn't give me proper control.
This is my code so far:
rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
'treatment': rng.integers(0,2, size=size),
'response': rng.integers (0,2, size=size)
})
When I run this I get:
df.groupby('treatment')['response'].mean()
0: 0.501696
1: 0.513530
But would like to have control over what the difference between treatment 0 and treatment 1. And generally just more control over all the variables.
Edit for solution:
Thanks to @noob and @lezaf for pointing me in the right direction.
The issue is that while you can get rng.choice
to bias the probability of the choice, it does so equally for everyone and doesn't do it across treatment and control groups.
My workaround is to create two response variables with each bias and to then create a new response variable that takes its value from the previous two response variables depending on their treatment assignment.
rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
'treatment': rng.integers(0,2, size=size),
'response1': rng.choice ([0,1], size=size, p=[0.1, 0.9]),
'response2': rng.choice ([0,1], size=size, p=[0.9, 0.1])
})
df['response'] = np.where(df['treatment']==1, df['response1'], df['response2'])
The alternative solution would be to create two separate DFs for treatment and control with their own probabilities, then concat.
英文:
I'm trying to generate some fake experimental data to play around with. What I'm struggling with is to determine the balance of the data for some variables/features.
Most importantly, the 'response' variable would ideally be imbalanced so that I can paly around with the treatment effect in the analysis later. But also in general so that the data is not super uniform.
All I can think of is to maybe not generate the data from a normal distribution, but this doesn't give me proper control.
This is my code so far:
rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
'treatment': rng.integers(0,2, size=size),
'response': rng.integers (0,2, size=size)
})
When I run this I get:
df.groupby('treatment')['response'].mean()
0: 0.501696
1: 0.513530
But would like to have control over what the difference between treatment 0 and treatment 1. And generally just more control over all the variables.
Edit for solution:
Thanks to @noob and @lezaf for pointing me in the right direction.
The issue is that while you can get rng.choice
to bias the probability of the choice, it does so equally for everyone and doesn't do it across treatment and control groups.
My workaround is to create two response variables with each bias and to then create a new response variable that takes its value from the previous two response variables depending on their treatment assignment.
rng = np.random.default_rng(seed=42)
# Sample size
size = 10000
df = pd.DataFrame({
'treatment': rng.integers(0,2, size=size),
'response1': rng.choice ([0,1], size=size, p=[0.1, 0.9]),
'response2': rng.choice ([0,1], size=size, p=[0.9, 0.1])
})
df['response'] = np.where(df['treatment']==1, df['response1'], df['response2'])
The alternative solution would be to create two separate DFs for treatment and control with their own probabilities, then concat.
答案1
得分: 1
尝试这个
#你可以根据需要设置概率
probs = [0.9, 0.1]
df['response'] = rng.choice([0, 1], size=size, p=probs)
英文:
try this
#you can set the probabilities as you like
probs = [0.9, 0.1]
df['response'] = rng.choice([0, 1], size=size, p=probs)
答案2
得分: 1
首先,response
是一个二进制属性,因为你使用 low=0(包含)
到 high=2(不包含)
的值来创建它。因此,在两种情况下都不可能获得你提到的 mean > 1
的值。那么,你是如何获得这些值的呢?
无论如何,你可以通过更改参数 p
来控制使用 numpy.random.Generator.choice
生成的数据的偏倚。根据文档:
p: 1-D array_like, optional
与a
中的每个条目相关联的概率。如果未提供,则采样假定在a
的所有条目上具有均匀分布。
简化示例:
In [1]: rng = np.random.default_rng(seed=42)
In [2]: df = pd.DataFrame({'treatment': rng.choice([0,1], size=10000, p=[0.5, 0.5]),
'response': rng.choice([0,1], size=10000, p=[0.3, 0.7])})
In [3]: df.groupby('treatment')['response'].mean()
Out[3]:
treatment
0 0.702729
1 0.705992
这是有道理的,因为 response
以 bias=0.7
的概率取值为1。此外,与 response 的均值无关的 treatment
属性的概率在示例中是均匀的。
英文:
Firstly, response
is a binary attribute since you create it with values from low=0 (inclusive)
to high=2 (exclusive)
. Therefore, it is not possible to get a mean > 1
as you mention in both cases. So, how did you get these values?
Anyway, you can control the bias of the data generated using numpy.random.Generator.choice
changing parameter p
. From docs:
> p: 1-D array_like, optional <br>
The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.
Simplified example:
In [1]: rng = np.random.default_rng(seed=42)
In [2]: df = pd.DataFrame({'treatment': rng.choice([0,1], size=10000, p=[0.5, 0.5]),
'response': rng.choice([0,1], size=10000, p=[0.3, 0.7])})
In [3]: df.groupby('treatment')['response'].mean()
Out[3]:
treatment
0 0.702729
1 0.705992
Which makes sense since response
takes value 1 with a bias=0.7
. Also, probabilities for treatment
attribute are irrelevant to the mean of response, thus I made it uniform in the example.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论