2023年5月14日 01:34:58go评论94阅读模式

英文:

Why is random.choice removing rows that it shouldn't?

问题

I am working on a dataset called 'gen' that has a column for 'churn?' with binary values - either 1 or 0. These are currently set to being integers.

The breakdown of value_counts() for the 'churn?' column is 1125 rows with a value of 0, and 154 rows with a value of 1.

I would like to randomly remove ~200 rows with a value of '0' so that the dataset is more balanced.

I have used the following code to do this:

# filter the dataset based on the specific values in the 'value' column
rows_to_remove = np.random.choice(gen[gen['churn?']==0].index, size=200, replace=False)
                     
# remove the selected rows from the original dataset
gen_new = gen.drop(rows_to_remove)

After doing this, I expected to have 925 rows with a value of 0, and 154 rows with a value of 1. However, I am only getting back 130 rows with a value of 1. For some reason, 24 rows are being removed.

So it seems that the correct number of rows with value 0 are being removed, but for some reason, some rows with value 1 are also being removed, which shouldn't be the case.

Can anyone help?

Thanks!!

英文:

I am working on a dataset called 'gen' that has a column for 'churn?' with binary values - either 1 or 0. These are currently set to being integers.

The breakdown of value_counts() for the 'churn?' column is 1125 rows with a value of 0, and 154 rows with a value of 1

I would like to randomly remove ~200 rows with a value of '0' for so that the dataset is more balanced.

I have used the following code to do this:

# filter the dataset based on the specific values in the &#39;value&#39; column
    rows_to_remove = np.random.choice(gen[gen[&#39;churn?&#39;]==0].index, size=200, replace=False)
                     
# remove the selected rows from the original dataset
    gen_new = gen.drop(rows_to_remove)

After doing this, I expected to have 925 rows with a value of 0, and 154 rows with a value of 1. However, I am only getting back 130 rows with a value of 1. For some reason 24, rows are being removed.

So it seems that the correct number of rows with value 0 are being removed, but for some reason some rows with value 1 are also being removed which shouldn't be the case.

Can anyone help?

Thanks!!

答案1

得分: 1

你猜对了，你的索引重复了。

要么重置你的索引（gen = gen.reset_index(drop=True)），然后使用你的方法。

要么使用 sample：

m = gen['churn?']==0
out = pd.concat([gen[~m], gen[m].sample(n=925)])

英文:

My guess is, you have duplicated indices.

Either reset your index (gen = gen.reset_index(drop=True)), then use your approach.

Or, use sample:

m = gen[&#39;churn?&#39;]==0
out = pd.concat([gen[~m], gen[m].sample(n=925)])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

random.choice为什么删除不应该删除的行？

问题

答案1

文件监视器循环无法在重新运行代码时继续上次的位置。

PriorityQueue在每次调用get时都调用sorted吗？

从pandas数据框中提取字符串列表的前3个元素。

PowerShell脚本获取用户信息需要数小时才能运行完–有没有加快速度的方法？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。