英文:
Pandas Count of Employees by Quarter and Company
问题
我有两个非常非常大的pandas数据框。
df_A每个YearQuarter和Company有一行
df_B每个不同的员工有一行
我想计算每个季度每个公司雇佣的员工人数。如果员工的StartYearQuarter <= YearQuarter且EndYearQuarter >= YearQuarter,则将员工计为该季度雇佣。
我迄今为止尝试了各种不同的方法,但它们都遇到了内存问题,或者返回了不正确的结果,因为数据框太大。
这里是我运行的一段代码的示例,它告诉我在Jupyter中运行时需要160GBi的空闲RAM,而且它刚刚崩溃了我的Azure Python内核:
```python
合并 = pd.merge(df_A, df_B, on =“Company Name”)
雇佣=合并df[(merged_df[‘StartYearQuarter’]<= merged_df[‘YearQuarter’])& (merged_df[‘EndYearQuarter’]>= merged_df[‘YearQuarter’])]
结果=雇佣.groupby([‘YearQuarter’,‘Company Name’]).size()。reset_index(name=‘Employee Count’)
有没有更节省内存的方法来计算每个YearQuarter的每个公司的员工人数?
非常感谢任何帮助!
<details>
<summary>英文:</summary>
I have two very, very large pandas dataframes.
[![df_A][1]][1]
df_A has one row for each YearQuarter and Company
[![df_B][2]][2]
df_B has one row for each distinct employee
I want to count the number of employees that are employed for each company each quarter. An employee is counted as employed in a quarter if their StartYearQuarter <= YearQuarter and EndYearQuarter >= YearQuarter.
I have tried a variety of different approaches so far but they have all ran into memory issues, or returned incorrect results, as the dataframes are so large.
Here is an example of one bit of code I ran which told me I would need 160GBi of free RAM when ran in Jupyter and just crashed my Azure Python Kernel:
```python
merged = pd.merge(df_A, df_B, on="Company Name")
employed = merged_df[(merged_df['StartYearQuarter'] <= merged_df['YearQuarter']) & (merged_df['EndYearQuarter'] >= merged_df['YearQuarter'])]
result = employed.groupby(['YearQuarter', 'Company Name']).size().reset_index(name='Employee Count')
Is there a more memory efficient way of counting the number of employees for each Company by YearQuarter?
Many thanks for any help!
答案1
得分: 1
如果您使用pyspark,可以使用以下代码:
data1 = [['1997Q3', 'test1'], ['1997Q4', 'test1']]
data2 = [['test1', '1997Q2', '1998Q1', 1], ['test1', '1997Q3', '1997Q3', 2]]
df1 = spark.createDataFrame(data1, ['YearQuarter', 'Company Name'])
df2 = spark.createDataFrame(data2, ['Company Name2', 'StartYearQuarter', 'EndYearQuarter', 'ID'])
df1.show()
df2.show()
df1.join(df2, (f.col('Company Name') == f.col('Company Name2')) & f.col('YearQuarter').between(f.col('StartYearQuarter'), f.col('EndYearQuarter')), 'inner') \
.groupBy('Company Name', 'YearQuarter') \
.count() \
.show()
这段代码用于使用pyspark处理数据,包括数据框的创建、连接和分组等操作。
英文:
If you use the pyspark,
data1 = [['1997Q3', 'test1'], ['1997Q4', 'test1']]
data2 = [['test1', '1997Q2', '1998Q1', 1], ['test1', '1997Q3', '1997Q3', 2]]
df1 = spark.createDataFrame(data1, ['YearQuarter', 'Company Name'])
df2 = spark.createDataFrame(data2, ['Company Name2', 'StartYearQuarter', 'EndYearQuarter', 'ID'])
df1.show()
df2.show()
df1.join(df2, (f.col('Company Name') == f.col('Company Name2')) & f.col('YearQuarter').between(f.col('StartYearQuarter'), f.col('EndYearQuarter')), 'inner') \
.groupBy('Company Name', 'YearQuarter') \
.count() \
.show()
+-----------+------------+
|YearQuarter|Company Name|
+-----------+------------+
| 1997Q3| test1|
| 1997Q4| test1|
+-----------+------------+
+-------------+----------------+--------------+---+
|Company Name2|StartYearQuarter|EndYearQuarter| ID|
+-------------+----------------+--------------+---+
| test1| 1997Q2| 1998Q1| 1|
| test1| 1997Q3| 1997Q3| 2|
+-------------+----------------+--------------+---+
+------------+-----------+-----+
|Company Name|YearQuarter|count|
+------------+-----------+-----+
| test1| 1997Q3| 2|
| test1| 1997Q4| 1|
+------------+-----------+-----+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论