问题

我有一个包含100,000行的数据框，其中包含Country、State、bill_ID、item_id、dates等列。
我想要从这100,000行中随机抽取5,000行，这些行应该至少包含来自所有国家和州的一个bill_ID。
简而言之，它应该涵盖所有国家和州，至少包含一个bill_ID。

注意：bill_ID包含多个item_id。

我正在对抽样数据进行测试，这些数据应该涵盖所有独特的国家和州以及它们的bill_ID。

英文:

I have a dataframe with 100 000 rows contains Country, State, bill_ID, item_id, dates etc... columns
I want to random sample 5k lines out of 100k lines which should have atleast one bill_ID from all countries and state.
In short it should cover all countries and states with atleast one bill_ID.

Note: bill_ID contains multiple item_id

I am doing testing on a sampled data which should cover all unique countries and states with there bill_IDs.

答案1

得分: 1

你可以使用 Pandas 的 .sample 方法。使用你的 DataFrame df 尝试以下代码：

sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()

首先按列 Country 和 State 进行分组，然后抽取大小为 1 的样本。这将给你一个名为 df_sample_1 的样本，其中每个 Country-State 组合都恰好出现一次。然后从不包含第一个样本的 DataFrame 中抽取其余部分，得到 df_sample_2。最后，将这两个样本连接在一起（如果需要，进行排序）。

英文:

You could use Pandas' .sample method. With df your dataframe try:

sample_size = 5_000
df_sample_1 = df.groupby([&quot;Country&quot;, &quot;State&quot;]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()

First group by columns Country and State and draw samples of size 1. This gives you a sample df_sample_1 that covers each Country-State-combination exactly once. Then draw the rest from the dataframe that doesn't contain the first sample: df_sample_2. Finally concatenate both samples (and sort the result if needed).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

随机抽样数据，基于其他列使用Python。

问题

答案1

如何使用request.get传递参数而不是完整路径

在Python 3.X中，是否可能使用FTP传输文件夹？

Python多进程回调

Selenium与Python脚本

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。