“Subsample after GroupBy” 可以翻译为 “分组后进行子采样”。

huangapple go评论75阅读模式
英文:

Subsample after GroupBy

问题

以下是您要的翻译:

我有一个看起来像这样的数据框:

df =

id   tweet
1     a
1     b
1     b
1     a
2     d
2     a
2     a
3     b
3     b
4     a
4     b

现在我想按id分组并获取它们的推文:

df.groupby("id").count()

这将导致:

df =

id   count
1     4
2     3
3     2
4     2

然而,我想对数据进行子采样,以便只保存在数据框中推文数量少于n的用户,并且如果您有多于n个样本(推文),则应随机对其进行子采样。我应该如何做?我已尝试以下方法,但它们只返回整个行的n个样本:

n = 3
print(data.groupby("user_id").apply(lambda x: x.sample(min(n, len(x)), replace=False)).reset_index(drop=True))
print(data.groupby('user_id').sample(n, random_state=1))

希望这有助于您理解如何进行子采样操作。

英文:

I have a dataframe that looks like this:

df =

id   tweet 
1     a         
1     b        
1     b        
1     a        
2     d         
2     a         
2     a         
3     b      
3     b   
4     a       
4     b     

Now I want to group by their id and get their tweets:

df.groupby(["id"]).count()

Which leads me to

df =

id   count
1     4         
2     3        
3     2     
4     2

However, I'd like to subsample the data so that only users with <n tweets are saved in the dataframe and if you have more than n samples (tweets) your tweets should get randomly subsampled. How do I do this? I haved tried the following but they only return n samples for the entire row...

n=3
print(data.groupby(["user_id"]).apply(lambda x: x.sample(min(n,len(x)), replace=False)).reset_index(drop=True))
print(data.groupby('user_id').sample(n, random_state=1))

答案1

得分: 1

shuffle然后 groupby().head():

df.sample(frac=1).groupby('id').head(N)
英文:

shuffle then groupby().head():

df.sample(frac=1).groupby('id').head(N)

huangapple
  • 本文由 发表于 2023年5月11日 03:48:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222106.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定