英文:
Remove the the obersvations which is more than the i'th duplicated observation pandas
问题
如果我有一个类似的数据框架,
a b c
1 2 3
1 2 3
.
.
而且我想允许,比如说,有100个重复的a
和b
对,即假设有200对a=1
和b=2
,我想保留其中的100对。
我无法在GroupBy
数据框架上使用duplicated
,因此我对如何解决这个问题感到困惑。
英文:
Say I have a dataframe like
a b c
1 2 3
1 2 3
.
.
and I want to allow, say, 100 duplicated values of a
and b
pairs i.e say theres 200 pairs of a=1
and b=2
then I want to keep 100 of those.
I cannot use duplicated
on a GroupBy
dataframe, thus I'm rather lost on how to solve this
答案1
得分: 2
# 保留的重复项数量
df.groupby(['a', 'b'], as_index=False).head(n)
英文:
# n: number of duplicates to keep
df.groupby(['a', 'b'], as_index=False).head(n)
答案2
得分: 1
我相信你可以这样做:
max_duplicates = 200
group_cols = ['a', 'b']
duplicates = df.duplicated(subset=group_cols, keep='first')
# 获取重复行子集的分组
groups = df[duplicates].groupby(group_cols)
# 连接没有重复的行以及每个组中允许的重复行数量
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])
英文:
I believe that you can do it that way:
max_duplicates = 200
group_cols = ['a', 'b']
duplicates = df.duplicated(subset=group_cols, keep='first')
# get groups of duplicated rows subsets
groups = df[duplicates].groupby(group_cols)
# join rows without duplicates and allowed number of duplicated rows from each group
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])
答案3
得分: 1
以下是翻译好的部分:
一个选项是按 a
和 b
进行分组。执行 cumcount
然后进行筛选。 示例:
df
a b c
0 1 2 1
1 1 2 2
2 1 2 3
3 1 2 4
4 2 2 1
5 2 2 2
要保留前3行:
df[df.groupby(['a', 'b']).cumcount() <= 2]
a b c
0 1 2 1
1 1 2 2
2 1 2 3
4 2 2 1
5 2 2 2
英文:
One options is to group by a
, b
. Do a cumcount
and then filter. Example:
df
a b c
0 1 2 1
1 1 2 2
2 1 2 3
3 1 2 4
4 2 2 1
5 2 2 2
To keep the first 3 rows:
df[df.groupby(['a', 'b']).cumcount() <= 2]
a b c
0 1 2 1
1 1 2 2
2 1 2 3
4 2 2 1
5 2 2 2
答案4
得分: 0
你可以使用pandas中的groupby方法和head方法来实现这一目标。以下是一个解决方案,只保留每对'a'和'b'的前100个重复项:
import pandas as pd
# 你的示例DataFrame
data = {'a': [1, 1, 2, 2, 1], 'b': [2, 2, 3, 3, 2], 'c': [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)
# 设置要保留的重复项数量
num_dups_to_keep = 100
# 按列'a'和'b'对DataFrame进行分组,然后保留每个分组的前'num_dups_to_keep'行
result = df.groupby(['a', 'b']).head(num_dups_to_keep)
# 重置索引
result = result.reset_index(drop=True)
print(result)
这段代码将会按照'a'和'b'列对DataFrame进行分组,然后保留每个分组的前100行。如果某个特定对的重复项少于100个,它将会保留所有重复项。
英文:
You can achieve this by using the groupby method in conjunction with the head method in pandas. Here's a solution to keep only the first 100 duplicates for each pair of 'a' and 'b':
import pandas as pd
# Your example DataFrame
data = {'a': [1, 1, 2, 2, 1], 'b': [2, 2, 3, 3, 2], 'c': [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)
# Set the number of duplicates you want to keep
num_dups_to_keep = 100
# Group the DataFrame by columns 'a' and 'b', and keep only the first 'num_dups_to_keep' rows for each group
result = df.groupby(['a', 'b']).head(num_dups_to_keep)
# Reset the index
result = result.reset_index(drop=True)
print(result)
This code snippet will group the DataFrame by the 'a' and 'b' columns, and then keep only the first 100 rows for each group. If you have less than 100 duplicates for a specific pair, it will keep all of them.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论