random.choice为什么删除不应该删除的行?

huangapple go评论56阅读模式
英文:

Why is random.choice removing rows that it shouldn't?

问题

I am working on a dataset called 'gen' that has a column for 'churn?' with binary values - either 1 or 0. These are currently set to being integers.

The breakdown of value_counts() for the 'churn?' column is 1125 rows with a value of 0, and 154 rows with a value of 1.

I would like to randomly remove ~200 rows with a value of '0' so that the dataset is more balanced.

I have used the following code to do this:

# filter the dataset based on the specific values in the 'value' column

rows_to_remove = np.random.choice(gen[gen['churn?']==0].index, size=200, replace=False)
                     
# remove the selected rows from the original dataset
gen_new = gen.drop(rows_to_remove)

After doing this, I expected to have 925 rows with a value of 0, and 154 rows with a value of 1. However, I am only getting back 130 rows with a value of 1. For some reason, 24 rows are being removed.

So it seems that the correct number of rows with value 0 are being removed, but for some reason, some rows with value 1 are also being removed, which shouldn't be the case.

Can anyone help?

Thanks!!

英文:

I am working on a dataset called 'gen' that has a column for 'churn?' with binary values - either 1 or 0. These are currently set to being integers.

The breakdown of value_counts() for the 'churn?' column is 1125 rows with a value of 0, and 154 rows with a value of 1

I would like to randomly remove ~200 rows with a value of '0' for so that the dataset is more balanced.

I have used the following code to do this:

# filter the dataset based on the specific values in the 'value' column

    rows_to_remove = np.random.choice(gen[gen['churn?']==0].index, size=200, replace=False)
                     
# remove the selected rows from the original dataset
    gen_new = gen.drop(rows_to_remove)

After doing this, I expected to have 925 rows with a value of 0, and 154 rows with a value of 1. However, I am only getting back 130 rows with a value of 1. For some reason 24, rows are being removed.

So it seems that the correct number of rows with value 0 are being removed, but for some reason some rows with value 1 are also being removed which shouldn't be the case.

Can anyone help?

Thanks!!

答案1

得分: 1

你猜对了,你的索引重复了。

要么重置你的索引(gen = gen.reset_index(drop=True)),然后使用你的方法。

要么使用 sample

m = gen['churn?']==0

out = pd.concat([gen[~m], gen[m].sample(n=925)])
英文:

My guess is, you have duplicated indices.

Either reset your index (gen = gen.reset_index(drop=True)), then use your approach.

Or, use sample:

m = gen['churn?']==0

out = pd.concat([gen[~m], gen[m].sample(n=925)])

huangapple
  • 本文由 发表于 2023年5月14日 01:34:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244097.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定