Pandas按季度和公司统计员工人数

huangapple go评论64阅读模式
英文:

Pandas Count of Employees by Quarter and Company

问题

我有两个非常非常大的pandas数据框

df_A每个YearQuarter和Company有一行

df_B每个不同的员工有一行

我想计算每个季度每个公司雇佣的员工人数如果员工的StartYearQuarter <= YearQuarter且EndYearQuarter >= YearQuarter则将员工计为该季度雇佣

我迄今为止尝试了各种不同的方法但它们都遇到了内存问题或者返回了不正确的结果因为数据框太大

这里是我运行的一段代码的示例它告诉我在Jupyter中运行时需要160GBi的空闲RAM而且它刚刚崩溃了我的Azure Python内核

```python
合并 = pd.merge(df_A df_B on =Company Name”)
雇佣=合并df[(merged_df[StartYearQuarter]<= merged_df[YearQuarter])& merged_df[EndYearQuarter]>= merged_df[YearQuarter]]
结果=雇佣.groupby([YearQuarter’,‘Company Name].size()。reset_indexname=Employee Count’)

有没有更节省内存的方法来计算每个YearQuarter的每个公司的员工人数?

非常感谢任何帮助!


<details>
<summary>英文:</summary>

I have two very, very large pandas dataframes.

[![df_A][1]][1]

df_A has one row for each YearQuarter and Company

[![df_B][2]][2]

df_B has one row for each distinct employee

I want to count the number of employees that are employed for each company each quarter. An employee is counted as employed in a quarter if their StartYearQuarter &lt;= YearQuarter and EndYearQuarter &gt;= YearQuarter.

I have tried a variety of different approaches so far but they have all ran into memory issues, or returned incorrect results, as the dataframes are so large.

Here is an example of one bit of code I ran which told me I would need 160GBi of free RAM when ran in Jupyter and just crashed my Azure Python Kernel:

```python
merged = pd.merge(df_A, df_B, on=&quot;Company Name&quot;)
employed = merged_df[(merged_df[&#39;StartYearQuarter&#39;] &lt;= merged_df[&#39;YearQuarter&#39;]) &amp; (merged_df[&#39;EndYearQuarter&#39;] &gt;= merged_df[&#39;YearQuarter&#39;])]
result = employed.groupby([&#39;YearQuarter&#39;, &#39;Company Name&#39;]).size().reset_index(name=&#39;Employee Count&#39;)

Is there a more memory efficient way of counting the number of employees for each Company by YearQuarter?

Many thanks for any help!

答案1

得分: 1

如果您使用pyspark,可以使用以下代码:

data1 = [['1997Q3', 'test1'], ['1997Q4', 'test1']]
data2 = [['test1', '1997Q2', '1998Q1', 1], ['test1', '1997Q3', '1997Q3', 2]]

df1 = spark.createDataFrame(data1, ['YearQuarter', 'Company Name'])
df2 = spark.createDataFrame(data2, ['Company Name2', 'StartYearQuarter', 'EndYearQuarter', 'ID'])

df1.show()
df2.show()

df1.join(df2, (f.col('Company Name') == f.col('Company Name2')) & f.col('YearQuarter').between(f.col('StartYearQuarter'), f.col('EndYearQuarter')), 'inner') \
  .groupBy('Company Name', 'YearQuarter') \
  .count() \
  .show()

这段代码用于使用pyspark处理数据,包括数据框的创建、连接和分组等操作。

英文:

If you use the pyspark,

data1 = [[&#39;1997Q3&#39;, &#39;test1&#39;], [&#39;1997Q4&#39;, &#39;test1&#39;]]
data2 = [[&#39;test1&#39;, &#39;1997Q2&#39;, &#39;1998Q1&#39;, 1], [&#39;test1&#39;, &#39;1997Q3&#39;, &#39;1997Q3&#39;, 2]]

df1 = spark.createDataFrame(data1, [&#39;YearQuarter&#39;, &#39;Company Name&#39;])
df2 = spark.createDataFrame(data2, [&#39;Company Name2&#39;, &#39;StartYearQuarter&#39;, &#39;EndYearQuarter&#39;, &#39;ID&#39;])

df1.show()
df2.show()

df1.join(df2, (f.col(&#39;Company Name&#39;) == f.col(&#39;Company Name2&#39;)) &amp; f.col(&#39;YearQuarter&#39;).between(f.col(&#39;StartYearQuarter&#39;), f.col(&#39;EndYearQuarter&#39;)), &#39;inner&#39;) \
  .groupBy(&#39;Company Name&#39;, &#39;YearQuarter&#39;) \
  .count() \
  .show()

+-----------+------------+
|YearQuarter|Company Name|
+-----------+------------+
|     1997Q3|       test1|
|     1997Q4|       test1|
+-----------+------------+

+-------------+----------------+--------------+---+
|Company Name2|StartYearQuarter|EndYearQuarter| ID|
+-------------+----------------+--------------+---+
|        test1|          1997Q2|        1998Q1|  1|
|        test1|          1997Q3|        1997Q3|  2|
+-------------+----------------+--------------+---+

+------------+-----------+-----+
|Company Name|YearQuarter|count|
+------------+-----------+-----+
|       test1|     1997Q3|    2|
|       test1|     1997Q4|    1|
+------------+-----------+-----+

huangapple
  • 本文由 发表于 2023年2月16日 19:04:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75471340.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定