如何使用`default_rng.choice`生成数据,以便生成的数据是不平衡的?

huangapple go评论61阅读模式
英文:

How to generate data using default_rng choice so that the generated data is imbalanced?

问题

I'm trying to generate some fake experimental data to play around with. What I'm struggling with is to determine the balance of the data for some variables/features.

Most importantly, the 'response' variable would ideally be imbalanced so that I can paly around with the treatment effect in the analysis later. But also in general so that the data is not super uniform.

All I can think of is to maybe not generate the data from a normal distribution, but this doesn't give me proper control.

This is my code so far:

rng = np.random.default_rng(seed=42)

# Sample size
size = 10000

df = pd.DataFrame({
    'treatment': rng.integers(0,2, size=size),
    'response': rng.integers (0,2, size=size)
})

When I run this I get:

df.groupby('treatment')['response'].mean()

0: 0.501696

1: 0.513530

But would like to have control over what the difference between treatment 0 and treatment 1. And generally just more control over all the variables.

Edit for solution:
Thanks to @noob and @lezaf for pointing me in the right direction.

The issue is that while you can get rng.choice to bias the probability of the choice, it does so equally for everyone and doesn't do it across treatment and control groups.

My workaround is to create two response variables with each bias and to then create a new response variable that takes its value from the previous two response variables depending on their treatment assignment.

rng = np.random.default_rng(seed=42)

# Sample size
size = 10000

df = pd.DataFrame({
    'treatment': rng.integers(0,2, size=size),
    'response1': rng.choice ([0,1], size=size, p=[0.1, 0.9]),
    'response2': rng.choice ([0,1], size=size, p=[0.9, 0.1])
})

df['response'] = np.where(df['treatment']==1, df['response1'], df['response2'])

The alternative solution would be to create two separate DFs for treatment and control with their own probabilities, then concat.

英文:

I'm trying to generate some fake experimental data to play around with. What I'm struggling with is to determine the balance of the data for some variables/features.

Most importantly, the 'response' variable would ideally be imbalanced so that I can paly around with the treatment effect in the analysis later. But also in general so that the data is not super uniform.

All I can think of is to maybe not generate the data from a normal distribution, but this doesn't give me proper control.

This is my code so far:

rng = np.random.default_rng(seed=42)



# Sample size
size = 10000

df = pd.DataFrame({
    'treatment': rng.integers(0,2, size=size),
    'response': rng.integers (0,2, size=size)
    
})

When I run this I get:

df.groupby('treatment')['response'].mean()

0: 0.501696

1: 0.513530

But would like to have control over what the difference between treatment 0 and treatment 1. And generally just more control over all the variables.

Edit for solution:
Thanks to @noob and @lezaf for pointing me in the right direction.

The issue is that while you can get rng.choice to bias the probability of the choice, it does so equally for everyone and doesn't do it across treatment and control groups.

My workaround is to create two response variables with each bias and to then create a new response variable that takes its value from the previous two response variables depending on their treatment assignment.

rng = np.random.default_rng(seed=42)


# Sample size
size = 10000



df = pd.DataFrame({
    'treatment': rng.integers(0,2, size=size),
    'response1': rng.choice ([0,1], size=size, p=[0.1, 0.9]),
    'response2': rng.choice ([0,1], size=size, p=[0.9, 0.1])
    
})


df['response'] = np.where(df['treatment']==1, df['response1'], df['response2'])

The alternative solution would be to create two separate DFs for treatment and control with their own probabilities, then concat.

答案1

得分: 1

尝试这个

#你可以根据需要设置概率
probs = [0.9, 0.1]

df['response'] = rng.choice([0, 1], size=size, p=probs)
英文:

try this

#you can set the probabilities as you like
probs = [0.9, 0.1]

df['response'] = rng.choice([0, 1], size=size, p=probs)

答案2

得分: 1

首先,response 是一个二进制属性,因为你使用 low=0(包含)high=2(不包含) 的值来创建它。因此,在两种情况下都不可能获得你提到的 mean > 1 的值。那么,你是如何获得这些值的呢?

无论如何,你可以通过更改参数 p 来控制使用 numpy.random.Generator.choice 生成的数据的偏倚。根据文档

p: 1-D array_like, optional
a 中的每个条目相关联的概率。如果未提供,则采样假定在 a 的所有条目上具有均匀分布。

简化示例:

In [1]: rng = np.random.default_rng(seed=42)

In [2]: df = pd.DataFrame({'treatment': rng.choice([0,1], size=10000, p=[0.5, 0.5]),
                           'response': rng.choice([0,1], size=10000, p=[0.3, 0.7])})

In [3]: df.groupby('treatment')['response'].mean()
Out[3]:
treatment
0    0.702729
1    0.705992

这是有道理的,因为 responsebias=0.7 的概率取值为1。此外,与 response均值无关的 treatment 属性的概率在示例中是均匀的。

英文:

Firstly, response is a binary attribute since you create it with values from low=0 (inclusive) to high=2 (exclusive). Therefore, it is not possible to get a mean > 1 as you mention in both cases. So, how did you get these values?

Anyway, you can control the bias of the data generated using numpy.random.Generator.choice changing parameter p. From docs:

> p: 1-D array_like, optional <br>
The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Simplified example:

In [1]: rng = np.random.default_rng(seed=42)

In [2]: df = pd.DataFrame({&#39;treatment&#39;: rng.choice([0,1], size=10000, p=[0.5, 0.5]),
                           &#39;response&#39;: rng.choice([0,1], size=10000, p=[0.3, 0.7])})

In [3]: df.groupby(&#39;treatment&#39;)[&#39;response&#39;].mean()
Out[3]:
treatment
0    0.702729
1    0.705992

Which makes sense since response takes value 1 with a bias=0.7. Also, probabilities for treatment attribute are irrelevant to the mean of response, thus I made it uniform in the example.

huangapple
  • 本文由 发表于 2023年2月14日 00:56:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438965.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定