英文:
Random Sample data based on other columns using python
问题
我有一个包含100,000行的数据框,其中包含Country、State、bill_ID、item_id、dates等列。
我想要从这100,000行中随机抽取5,000行,这些行应该至少包含来自所有国家和州的一个bill_ID。
简而言之,它应该涵盖所有国家和州,至少包含一个bill_ID。
注意:bill_ID包含多个item_id。
我正在对抽样数据进行测试,这些数据应该涵盖所有独特的国家和州以及它们的bill_ID。
英文:
I have a dataframe with 100 000 rows contains Country, State, bill_ID, item_id, dates etc... columns
I want to random sample 5k lines out of 100k lines which should have atleast one bill_ID from all countries and state.
In short it should cover all countries and states with atleast one bill_ID.
Note: bill_ID contains multiple item_id
I am doing testing on a sampled data which should cover all unique countries and states with there bill_IDs.
答案1
得分: 1
你可以使用 Pandas 的 .sample
方法。使用你的 DataFrame df
尝试以下代码:
sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()
首先按列 Country
和 State
进行分组,然后抽取大小为 1 的样本。这将给你一个名为 df_sample_1
的样本,其中每个 Country
-State
组合都恰好出现一次。然后从不包含第一个样本的 DataFrame 中抽取其余部分,得到 df_sample_2
。最后,将这两个样本连接在一起(如果需要,进行排序)。
英文:
You could use Pandas' .sample
method. With df
your dataframe try:
sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()
First group by columns Country
and State
and draw samples of size 1. This gives you a sample df_sample_1
that covers each Country
-State
-combination exactly once. Then draw the rest from the dataframe that doesn't contain the first sample: df_sample_2
. Finally concatenate both samples (and sort the result if needed).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论