Remove the observations which are more than the i’th duplicated observation in pandas.

huangapple go评论70阅读模式
英文:

Remove the the obersvations which is more than the i'th duplicated observation pandas

问题

如果我有一个类似的数据框架,

a  b  c
1  2  3
1  2  3
.
.

而且我想允许,比如说,有100个重复的ab对,即假设有200对a=1b=2,我想保留其中的100对。

我无法在GroupBy数据框架上使用duplicated,因此我对如何解决这个问题感到困惑。

英文:

Say I have a dataframe like

a  b  c
1  2  3
1  2  3
.
.

and I want to allow, say, 100 duplicated values of a and b pairs i.e say theres 200 pairs of a=1 and b=2 then I want to keep 100 of those.

I cannot use duplicated on a GroupBy dataframe, thus I'm rather lost on how to solve this

答案1

得分: 2

# 保留的重复项数量
df.groupby(['a', 'b'], as_index=False).head(n)
英文:
# n: number of duplicates to keep
df.groupby(['a', 'b'], as_index=False).head(n)

答案2

得分: 1

我相信你可以这样做:

max_duplicates = 200
group_cols = ['a', 'b'] 

duplicates = df.duplicated(subset=group_cols, keep='first')

# 获取重复行子集的分组
groups = df[duplicates].groupby(group_cols)

# 连接没有重复的行以及每个组中允许的重复行数量
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])
英文:

I believe that you can do it that way:

max_duplicates = 200
group_cols = ['a', 'b'] 

duplicates = df.duplicated(subset=group_cols, keep='first')

# get groups of duplicated rows subsets
groups = df[duplicates].groupby(group_cols)

# join rows without duplicates and allowed number of duplicated rows from each group 
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])

答案3

得分: 1

以下是翻译好的部分:

一个选项是按 ab 进行分组。执行 cumcount 然后进行筛选。 示例:

df
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
3  1  2  4
4  2  2  1
5  2  2  2

要保留前3行:

df[df.groupby(['a', 'b']).cumcount() <= 2]
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
4  2  2  1
5  2  2  2
英文:

One options is to group by a, b. Do a cumcount and then filter. Example:

df
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
3  1  2  4
4  2  2  1
5  2  2  2

To keep the first 3 rows:

df[df.groupby([&#39;a&#39;, &#39;b&#39;]).cumcount() &lt;= 2]
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
4  2  2  1
5  2  2  2

答案4

得分: 0

你可以使用pandas中的groupby方法和head方法来实现这一目标。以下是一个解决方案,只保留每对'a'和'b'的前100个重复项:

import pandas as pd

# 你的示例DataFrame
data = {'a': [1, 1, 2, 2, 1], 'b': [2, 2, 3, 3, 2], 'c': [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)

# 设置要保留的重复项数量
num_dups_to_keep = 100

# 按列'a'和'b'对DataFrame进行分组,然后保留每个分组的前'num_dups_to_keep'行
result = df.groupby(['a', 'b']).head(num_dups_to_keep)

# 重置索引
result = result.reset_index(drop=True)

print(result)

这段代码将会按照'a'和'b'列对DataFrame进行分组,然后保留每个分组的前100行。如果某个特定对的重复项少于100个,它将会保留所有重复项。

英文:

You can achieve this by using the groupby method in conjunction with the head method in pandas. Here's a solution to keep only the first 100 duplicates for each pair of 'a' and 'b':

import pandas as pd

# Your example DataFrame
data = {&#39;a&#39;: [1, 1, 2, 2, 1], &#39;b&#39;: [2, 2, 3, 3, 2], &#39;c&#39;: [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)

# Set the number of duplicates you want to keep
num_dups_to_keep = 100

# Group the DataFrame by columns &#39;a&#39; and &#39;b&#39;, and keep only the first &#39;num_dups_to_keep&#39; rows for each group
result = df.groupby([&#39;a&#39;, &#39;b&#39;]).head(num_dups_to_keep)

# Reset the index
result = result.reset_index(drop=True)

print(result)

This code snippet will group the DataFrame by the 'a' and 'b' columns, and then keep only the first 100 rows for each group. If you have less than 100 duplicates for a specific pair, it will keep all of them.

huangapple
  • 本文由 发表于 2023年5月7日 02:26:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76190481.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定