2023年6月29日 05:14:01go评论54阅读模式

英文:

How to create a smaller dataframe from an existing dataframe with the same numbers per label

问题

我有一个包含5万行和两列的数据框，列名分别为“item”和“labels”。我想要减少行数，但保持所有标签的值不变。

标签 "notebook"：1000 行
标签 "ballpoint"：1000 行
标签 "pencil"：1000 行
标签 "eraser"：1000 行
标签 "pencil sharpener"：1000 行

所以从5万行减少到只有每个标签都有相同数量的5000行。

英文:

I have a dataframe with 50k rows and two columns, item and labels. I want to reduce the number of rows but keep the same values for all labels.
So it looks like:

Label "notebook": 1000 rows
Label "ballpoint": 1000 rows
Label "pencil": 1000 rows
Label "eraser": 1000 rows
Label "pencil sharpener": 1000 rows

So from 50k rows, it reduces to only 5000 rows with the same number of rows for each label.

答案1

得分: 1

你需要执行分层抽样，简单来说就是将数据分成不同的组，然后从每个组中进行抽样。

抽样可以是成比例的，也可以是不成比例的。由于你已经提到想要每个标签都有1000行数据，所以选择不成比例抽样。以下是抽样的示例代码：

data = {    
    "item": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "label": ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A', 'B'],
}
df = pd.DataFrame(data)

# 对每个标签抽样两行数据
df.groupby("label").sample(n=2)
print(df)

输出结果如下：

   item label
0     3     A
1     7     B
2     6     A
3     4     C
4     5     B
5     8     B
6     1     A
7     2     C
8     9     A
9    10     B

请注意，这个代码示例中的抽样是成比例的，如果要进行不成比例抽样，需要进行相应的调整。

英文:

You need to perform stratified sampling which simply means converting your data into groups and then sample from each group.

The sampling could be proportionate or disproportionate. Since you have already mentioned that you want 1000 rows for each label, go for disproportionate sampling. The sample code is below:

data = {    
    &quot;item&quot;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    &quot;label&quot;: [&#39;A&#39;, &#39;B&#39;, &#39;A&#39;, &#39;C&#39;, &#39;B&#39;, &#39;B&#39;, &#39;A&#39;, &#39;C&#39;, &#39;A&#39;, &#39;B&#39;],
}
df = pd.DataFrame(data)

# Sampling two rows for each labels
df.groupby(&quot;label&quot;).sample(n=2)
print(df)

   item	label
0	3	A
1	7	A
2	6	B
3	5	B
4	4	C
5	8	C

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to create a smaller dataframe from an existing dataframe with the same numbers per label

问题

答案1

DataFrame to_json(‘gs://bucket/path’)无法工作。

将宽格式数据（分开的数据框）使用Python转换为长格式。

在Python中将数据框更改为字符串

在 Pandas 多级索引的横截面切片中设置数值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论