英文:
Subsample after GroupBy
问题
以下是您要的翻译:
我有一个看起来像这样的数据框:
df =
id tweet
1 a
1 b
1 b
1 a
2 d
2 a
2 a
3 b
3 b
4 a
4 b
现在我想按id分组并获取它们的推文:
df.groupby("id").count()
这将导致:
df =
id count
1 4
2 3
3 2
4 2
然而,我想对数据进行子采样,以便只保存在数据框中推文数量少于n的用户,并且如果您有多于n个样本(推文),则应随机对其进行子采样。我应该如何做?我已尝试以下方法,但它们只返回整个行的n个样本:
n = 3
print(data.groupby("user_id").apply(lambda x: x.sample(min(n, len(x)), replace=False)).reset_index(drop=True))
print(data.groupby('user_id').sample(n, random_state=1))
希望这有助于您理解如何进行子采样操作。
英文:
I have a dataframe that looks like this:
df =
id tweet
1 a
1 b
1 b
1 a
2 d
2 a
2 a
3 b
3 b
4 a
4 b
Now I want to group by their id and get their tweets:
df.groupby(["id"]).count()
Which leads me to
df =
id count
1 4
2 3
3 2
4 2
However, I'd like to subsample the data so that only users with <n
tweets are saved in the dataframe and if you have more than n
samples (tweets) your tweets should get randomly subsampled. How do I do this? I haved tried the following but they only return n
samples for the entire row...
n=3
print(data.groupby(["user_id"]).apply(lambda x: x.sample(min(n,len(x)), replace=False)).reset_index(drop=True))
print(data.groupby('user_id').sample(n, random_state=1))
答案1
得分: 1
shuffle然后 groupby().head()
:
df.sample(frac=1).groupby('id').head(N)
英文:
shuffle then groupby().head()
:
df.sample(frac=1).groupby('id').head(N)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论