2023年5月7日 02:26:58go评论199阅读模式

英文:

Remove the the obersvations which is more than the i'th duplicated observation pandas

问题

如果我有一个类似的数据框架，

而且我想允许，比如说，有100个重复的a和b对，即假设有200对a=1和b=2，我想保留其中的100对。

我无法在GroupBy数据框架上使用duplicated，因此我对如何解决这个问题感到困惑。

英文:

Say I have a dataframe like

and I want to allow, say, 100 duplicated values of a and b pairs i.e say theres 200 pairs of a=1 and b=2 then I want to keep 100 of those.

I cannot use duplicated on a GroupBy dataframe, thus I'm rather lost on how to solve this

答案1

得分: 2

# 保留的重复项数量
df.groupby(['a', 'b'], as_index=False).head(n)

英文:

# n: number of duplicates to keep
df.groupby([&#39;a&#39;, &#39;b&#39;], as_index=False).head(n)

答案2

得分: 1

我相信你可以这样做：

max_duplicates = 200
group_cols = ['a', 'b'] 

duplicates = df.duplicated(subset=group_cols, keep='first')

# 获取重复行子集的分组
groups = df[duplicates].groupby(group_cols)

# 连接没有重复的行以及每个组中允许的重复行数量
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])

英文:

I believe that you can do it that way:

max_duplicates = 200
group_cols = [&#39;a&#39;, &#39;b&#39;] 

duplicates = df.duplicated(subset=group_cols, keep=&#39;first&#39;)

# get groups of duplicated rows subsets
groups = df[duplicates].groupby(group_cols)

# join rows without duplicates and allowed number of duplicated rows from each group 
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])

答案3

得分: 1

以下是翻译好的部分：

一个选项是按 a 和 b 进行分组。执行 cumcount 然后进行筛选。 示例：

要保留前3行：

df[df.groupby(['a', 'b']).cumcount() <= 2]
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
4  2  2  1
5  2  2  2

英文:

One options is to group by a, b. Do a cumcount and then filter. Example:

To keep the first 3 rows:

df[df.groupby([&#39;a&#39;, &#39;b&#39;]).cumcount() &lt;= 2]
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
4  2  2  1
5  2  2  2

答案4

得分: 0

你可以使用pandas中的groupby方法和head方法来实现这一目标。以下是一个解决方案，只保留每对'a'和'b'的前100个重复项：

import pandas as pd

# 你的示例DataFrame
data = {'a': [1, 1, 2, 2, 1], 'b': [2, 2, 3, 3, 2], 'c': [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)

# 设置要保留的重复项数量
num_dups_to_keep = 100

# 按列'a'和'b'对DataFrame进行分组，然后保留每个分组的前'num_dups_to_keep'行
result = df.groupby(['a', 'b']).head(num_dups_to_keep)

# 重置索引
result = result.reset_index(drop=True)

print(result)

这段代码将会按照'a'和'b'列对DataFrame进行分组，然后保留每个分组的前100行。如果某个特定对的重复项少于100个，它将会保留所有重复项。

英文:

You can achieve this by using the groupby method in conjunction with the head method in pandas. Here's a solution to keep only the first 100 duplicates for each pair of 'a' and 'b':

import pandas as pd

# Your example DataFrame
data = {&#39;a&#39;: [1, 1, 2, 2, 1], &#39;b&#39;: [2, 2, 3, 3, 2], &#39;c&#39;: [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)

# Set the number of duplicates you want to keep
num_dups_to_keep = 100

# Group the DataFrame by columns &#39;a&#39; and &#39;b&#39;, and keep only the first &#39;num_dups_to_keep&#39; rows for each group
result = df.groupby([&#39;a&#39;, &#39;b&#39;]).head(num_dups_to_keep)

# Reset the index
result = result.reset_index(drop=True)

print(result)

This code snippet will group the DataFrame by the 'a' and 'b' columns, and then keep only the first 100 rows for each group. If you have less than 100 duplicates for a specific pair, it will keep all of them.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Remove the observations which are more than the i’th duplicated observation in pandas.

问题

答案1

答案2

答案3

答案4

合并具有不同的两个键的pandas数据帧

在CMD中，直到按下Enter键才停止循环。

Beautiful Soup – 获取HTML中非标准标签中的特定数值

使用Docker中的mysqlclient

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论