绘制随时间变化的二元结果变量的散点图

huangapple go评论91阅读模式
英文:

Plotting a scatter plot of a binary outcome variable over time

问题

| user_name | thread_name || post_text | time_of_post || date_joined | binary_target |
| -------- | -------- || -------- | -------- || -------- | -------- |
| BoxCutter |当我们这样做没问题。|| ..... | 2022-08-09 19:39:00 || 2022-05-26 | 1 |
| Docket_33 |当我们这样做没问题。|| ..... | 2022-08-09 19:54:00 || 2022-06-10 | 1 |
| Hearmeout |当我们这样做没问题。|| ..... | 2022-08-09 19:58:00 || 2021-10-07 | 0 |

我已经在post_text列上运行了一个二进制分类器,给出了binary_target列中的1或0的实例。我希望为数据框中的每个日期(time_of_post)绘制图表,显示被分类为1的帖子数量。但由于每天的总帖子数量不同,我希望首先计算每个日期中1的总数作为每个日期总帖子的百分比。

我想要的输出将是一个散点图,x轴上是日期,y轴上是“1的实例占总帖子的百分比”。

我可以轻松地使用以下代码获得二进制目标的值计数:

df_combined.groupby('date_joined')['binary_target'].value_counts()

尽管我在计算百分比方面有困难。

英文:

I have a large data frame consisting of extracted forum posts that looks something like this

| user_name | thread_name || post_text | time_of_post || date_joined | binary_target |
| -------- | -------- || -------- | -------- || -------- | -------- |
| BoxCutter |It's Okay When We Do It. || ..... | 2022-08-09 19:39:00 || 2022-05-26 | 1 |
| Docket_33 |It's Okay When We Do It. || ..... | 2022-08-09 19:54:00 || 2022-06-10 | 1 |
| Hearmeout |It's Okay When We Do It. || ..... | 2022-08-09 19:58:00 || 2021-10-07 | 0 |

I have run a binary classifier on the post_text column, giving instances of 1 or 0 in the binary_target column. I wish to plot, for each date present in the data frame (time_of_post), the number of posts classified as a 1. However, since the total number of posts for each day varies, I would like to first calculate the total number of 1s for each date as a percentage of the total number of posts for each date.

The output I desire will be a scatter plot, with date on the x axis, and the y axis being 'instances of 1s as a % of total posts'.

I am able to easily obtain the value counts of the binary targets using:

df_combined.groupby('date_joined')['binary_target'].value_counts()

Though I am struggling with calculating them in percentage terms.

答案1

得分: 1

你可以通过将每个日期上的观测值相加,然后除以每个日期上的总观测次数来获得1的数量。你可以使用 groupbyagg 来获得这两个值。

df_agg = df_combined.groupby('date_joined')['binary_target'].agg(['sum', 'count'])
print(df_agg)

然后,你可以创建一个新列来计算每个日期上1的比例。

df_agg['prop'] = df_agg['sum'] / df_agg['count']
print(df_agg)
英文:

You can get the number of 1s by taking the sum of the observations on each date, then divide that by the total count of observations on each date. You can get both of those using groupby and agg.

df_agg = df_combined.groupby('date_joined')['binary_target'].agg(['sum', 'count'])
print(df_agg)

From there you can create a new column to compute the proportion of 1s on each date.

df_agg['prop'] = df_agg['sum'] / df_agg['count']
print(df_agg)

huangapple
  • 本文由 发表于 2023年3月9日 20:21:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684543.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定