2023年3月9日 20:21:08go评论129阅读模式

英文:

Plotting a scatter plot of a binary outcome variable over time

问题

| user_name | thread_name || post_text | time_of_post || date_joined | binary_target |
| -------- | -------- || -------- | -------- || -------- | -------- |
| BoxCutter |当我们这样做没问题。|| ..... | 2022-08-09 19:39:00 || 2022-05-26 | 1 |
| Docket_33 |当我们这样做没问题。|| ..... | 2022-08-09 19:54:00 || 2022-06-10 | 1 |
| Hearmeout |当我们这样做没问题。|| ..... | 2022-08-09 19:58:00 || 2021-10-07 | 0 |

我已经在post_text列上运行了一个二进制分类器，给出了binary_target列中的1或0的实例。我希望为数据框中的每个日期（time_of_post）绘制图表，显示被分类为1的帖子数量。但由于每天的总帖子数量不同，我希望首先计算每个日期中1的总数作为每个日期总帖子的百分比。

我想要的输出将是一个散点图，x轴上是日期，y轴上是“1的实例占总帖子的百分比”。

我可以轻松地使用以下代码获得二进制目标的值计数：

df_combined.groupby('date_joined')['binary_target'].value_counts()

尽管我在计算百分比方面有困难。

英文:

I have a large data frame consisting of extracted forum posts that looks something like this

| user_name | thread_name || post_text | time_of_post || date_joined | binary_target |
| -------- | -------- || -------- | -------- || -------- | -------- |
| BoxCutter |It's Okay When We Do It. || ..... | 2022-08-09 19:39:00 || 2022-05-26 | 1 |
| Docket_33 |It's Okay When We Do It. || ..... | 2022-08-09 19:54:00 || 2022-06-10 | 1 |
| Hearmeout |It's Okay When We Do It. || ..... | 2022-08-09 19:58:00 || 2021-10-07 | 0 |

I have run a binary classifier on the post_text column, giving instances of 1 or 0 in the binary_target column. I wish to plot, for each date present in the data frame (time_of_post), the number of posts classified as a 1. However, since the total number of posts for each day varies, I would like to first calculate the total number of 1s for each date as a percentage of the total number of posts for each date.

The output I desire will be a scatter plot, with date on the x axis, and the y axis being 'instances of 1s as a % of total posts'.

I am able to easily obtain the value counts of the binary targets using:

df_combined.groupby(&#39;date_joined&#39;)[&#39;binary_target&#39;].value_counts()

Though I am struggling with calculating them in percentage terms.

答案1

得分: 1

你可以通过将每个日期上的观测值相加，然后除以每个日期上的总观测次数来获得1的数量。你可以使用 groupby 和 agg 来获得这两个值。

df_agg = df_combined.groupby('date_joined')['binary_target'].agg(['sum', 'count'])
print(df_agg)

然后，你可以创建一个新列来计算每个日期上1的比例。

df_agg['prop'] = df_agg['sum'] / df_agg['count']
print(df_agg)

英文:

You can get the number of 1s by taking the sum of the observations on each date, then divide that by the total count of observations on each date. You can get both of those using groupby and agg.

df_agg = df_combined.groupby(&#39;date_joined&#39;)[&#39;binary_target&#39;].agg([&#39;sum&#39;, &#39;count&#39;])
print(df_agg)

From there you can create a new column to compute the proportion of 1s on each date.

df_agg[&#39;prop&#39;] = df_agg[&#39;sum&#39;] / df_agg[&#39;count&#39;]
print(df_agg)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

绘制随时间变化的二元结果变量的散点图

问题

答案1

如何在bash脚本中调用Jupyter Notebook函数？

`except`块在Python的`try`块中不起作用。

如何使用Python从多个7z文件中提取多个文件？

根据变量名称中相同的后缀，在循环中合并多个数据框。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。