随机抽样数据,基于其他列使用Python。

huangapple go评论72阅读模式
英文:

Random Sample data based on other columns using python

问题

我有一个包含100,000行的数据框,其中包含Country、State、bill_ID、item_id、dates等列。
我想要从这100,000行中随机抽取5,000行,这些行应该至少包含来自所有国家和州的一个bill_ID。
简而言之,它应该涵盖所有国家和州,至少包含一个bill_ID。

注意:bill_ID包含多个item_id。

我正在对抽样数据进行测试,这些数据应该涵盖所有独特的国家和州以及它们的bill_ID。

英文:

I have a dataframe with 100 000 rows contains Country, State, bill_ID, item_id, dates etc... columns
I want to random sample 5k lines out of 100k lines which should have atleast one bill_ID from all countries and state.
In short it should cover all countries and states with atleast one bill_ID.

Note: bill_ID contains multiple item_id

I am doing testing on a sampled data which should cover all unique countries and states with there bill_IDs.

答案1

得分: 1

你可以使用 Pandas 的 .sample 方法。使用你的 DataFrame df 尝试以下代码:

sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()

首先按列 CountryState 进行分组,然后抽取大小为 1 的样本。这将给你一个名为 df_sample_1 的样本,其中每个 Country-State 组合都恰好出现一次。然后从不包含第一个样本的 DataFrame 中抽取其余部分,得到 df_sample_2。最后,将这两个样本连接在一起(如果需要,进行排序)。

英文:

You could use Pandas' .sample method. With df your dataframe try:

sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()

First group by columns Country and State and draw samples of size 1. This gives you a sample df_sample_1 that covers each Country-State-combination exactly once. Then draw the rest from the dataframe that doesn't contain the first sample: df_sample_2. Finally concatenate both samples (and sort the result if needed).

huangapple
  • 本文由 发表于 2023年6月13日 15:40:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76462666.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定