英文:
Why is random.choice removing rows that it shouldn't?
问题
I am working on a dataset called 'gen' that has a column for 'churn?' with binary values - either 1 or 0. These are currently set to being integers.
The breakdown of value_counts() for the 'churn?' column is 1125 rows with a value of 0, and 154 rows with a value of 1.
I would like to randomly remove ~200 rows with a value of '0' so that the dataset is more balanced.
I have used the following code to do this:
# filter the dataset based on the specific values in the 'value' column
rows_to_remove = np.random.choice(gen[gen['churn?']==0].index, size=200, replace=False)
# remove the selected rows from the original dataset
gen_new = gen.drop(rows_to_remove)
After doing this, I expected to have 925 rows with a value of 0, and 154 rows with a value of 1. However, I am only getting back 130 rows with a value of 1. For some reason, 24 rows are being removed.
So it seems that the correct number of rows with value 0 are being removed, but for some reason, some rows with value 1 are also being removed, which shouldn't be the case.
Can anyone help?
Thanks!!
英文:
I am working on a dataset called 'gen' that has a column for 'churn?' with binary values - either 1 or 0. These are currently set to being integers.
The breakdown of value_counts() for the 'churn?' column is 1125 rows with a value of 0, and 154 rows with a value of 1
I would like to randomly remove ~200 rows with a value of '0' for so that the dataset is more balanced.
I have used the following code to do this:
# filter the dataset based on the specific values in the 'value' column
rows_to_remove = np.random.choice(gen[gen['churn?']==0].index, size=200, replace=False)
# remove the selected rows from the original dataset
gen_new = gen.drop(rows_to_remove)
After doing this, I expected to have 925 rows with a value of 0, and 154 rows with a value of 1. However, I am only getting back 130 rows with a value of 1. For some reason 24, rows are being removed.
So it seems that the correct number of rows with value 0 are being removed, but for some reason some rows with value 1 are also being removed which shouldn't be the case.
Can anyone help?
Thanks!!
答案1
得分: 1
你猜对了,你的索引重复了。
要么重置你的索引(gen = gen.reset_index(drop=True)
),然后使用你的方法。
要么使用 sample
:
m = gen['churn?']==0
out = pd.concat([gen[~m], gen[m].sample(n=925)])
英文:
My guess is, you have duplicated indices.
Either reset your index (gen = gen.reset_index(drop=True)
), then use your approach.
Or, use sample
:
m = gen['churn?']==0
out = pd.concat([gen[~m], gen[m].sample(n=925)])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论