Attrition Calculation Performance.

huangapple go评论78阅读模式
英文:

Attrition Calculation Performance

问题

TLDR:

  1. 如何加快计算员工流失的SQL运算速度,考虑到它的时间复杂度是O(n^2)。
  2. 如何分发Python实现的员工流失计算,考虑到它需要维护一个计数器(累积离职者和累积加入者)。
  3. 还有其他想法吗?(例如InfluxDB?Kinetica?)

我需要为每家公司每个月计算员工流失率,使用一个类似下面的大表格。为了简化,起始日期和结束日期都是每月的第一天,因为月度流失率已经足够了。

peopleid | companyid | startdate  | enddate
Joe      | MSFT      | 2010-01-01 | 2015-02-01
...

我有一个SQL解决方案(伪代码如下),由于日历表和就业历史的交叉连接非常大,所以执行时间很长。

例如,如果历史表有100亿行,50年共有600个月。交叉连接将产生10e9 * 6e2 = 6e12,即6万亿,非常昂贵。

with cal as (
    select yearmonth
    from calendar
),
people_left (
    select cal.yearmonth,
        eh.companyid,
        count(distinct eh.peopleid) as peopleid_left_count
    from cal
        inner join employment_history eh on cal.yearmonth = eh.enddate_yearmonth
    group by cal.yearmonth,
        eh.companyid
),
people_avg (
    select cal.yearmonth,
        eh.companyid,
        count(distinct eh.peopleid) as peopleid_count
    from cal
        inner join employment_history eh on cal.yearmonth >= eh.startdate_yearmonth
        and cal.yearmonth <= eh.enddate_yearmonth
    group by cal.yearmonth,
        eh.companyid
)
select n.yearmonth,
    n.companyid,
    n.peopleid_left_count / d.peopleid_count as attrition -- 除以零
from people_left n
    inner join people_avg n on n.yearmonth = d.yearmonth
    and n.companyid = d.companyid;

我还实现了一个Python解决方案,以下是关键步骤:

  1. 将就业历史表重组为就业事件表(以前的工作有开始和结束日期,现在被拆分为两条记录,一条用于加入,一条用于离开,通过月份索引以便快速查找,时间复杂度为O(1))。
  2. 为每家公司维护一个大的查找字典(时间复杂度为O(1)),记录累积加入者和累积离职者的计数器。
  3. 从第一个月开始,逐月遍历直到结束。
  4. 在每个月内,查找所有加入和离职事件,并更新该公司该月的计数器,使用该月离职者数除以累积加入者和累积离职者之差来计算流失率。

这比SQL实现要快得多,但是有可能将计算分发以加快速度吗?累积步骤依赖于所有之前的快照,所以我无法理解如何以MapReduce的方式思考这个问题。

英文:

TLDR:

  1. how to speed up the SQL calculation for attrition given it is O(n^2)
  2. how to distribute the python implementation for attrition given it maintains a counter (cumulative leavers and cumulative joiners)
  3. any other ideas? (influxdb? kinetica? )

I need to calculate employee attrition for every company, every month using a big table that looks like this. Startdate and enddate is the first day of the month to keep things simple as monthly attrition rate is sufficient.

peopleid | companyid | startdate  | enddate
Joe      | MSFT      | 2010-01-01 | 2015-02-01
...

I have a SQL solution (pseudo code below) that is taking a very long time, mostly due to the cross join between the calendar table and the employment history which is big.

For example, if the history table is 10B, and 50 years has 600 month in total. The cross join will yield 10e9 * 6e2 = 6e12, 6 trillion which is very expensive.

with cal as (
    select yearmonth
    from calendar
),
people_left (
    select cal.yearmonth,
        eh.companyid,
        count(distinct eh.peopleid) as peopleid_left_count
    from cal
        inner join employment_history eh on cal.yearmonth = eh.enddate_yearmonth
    group by cal.yearmonth,
        eh.companyid
),
people_avg (
    select cal.yearmonth,
        eh.companyid,
        count(distinct eh.peopleid) as peopleid_count
    from cal
        inner join employment_history eh on cal.yearmonth &gt;= eh.startdate_yearmonth
        and cal.yearmonth &lt;= eh.enddate_yearmonth
    group by cal.yearmonth,
        eh.companyid
)
select n.yearmonth,
    n.companyid,
    n.peopleid_left_count / d.peopleid_count as attrition -- division by zero
from people_left n
    inner join people_avg n on n.yearmonth = d.yearmonth
    and n.companyid = d.companyid;

I also implemented a python solution, here are the key steps:

  1. restructure the employment history table as employment events table ( previously one job with a start and end date now got split into two records, with one for joining and the other for leaving, indexed by month for quick lookup O(1) )
  2. maintain one big lookup dictionary O(1) for each company as the counters for cumulative joiners, and cumulative leavers
  3. starting from month 1 and go through each month one by one till the end
  4. within each month, find all the joining and leaving events, and update the counter for that company that month, calculate the churn using that month leavers divide by the difference between cumulative joiners and cumulative leavers

This is significantly faster than the SQL implementation but is it even possible to distribute the computing to speed up? the accumulation step is dependent on all previously snapshots so I cannot wrap my head around how to think in a mapreduce way.

答案1

得分: 1

这比SQL实现快得多,但是否可能将计算分布以加速?累积步骤依赖于先前的所有快照,因此我无法理解如何以MapReduce的方式思考。

以下是两种方法:

  1. 不同的公司是独立的。您可以在一个核心上处理MSFT,在另一个核心上处理不同公司的员工。

  2. 您可以将累积和步骤拆分为两个步骤:

    a. 接收形式为(公司,年月,加入或离开)的元组。对于每个公司和年月的组合,总结一个公司和年月内的所有元组,并输出形式为(公司,年月,加入数,离开数)的元组。这是一个非累积总和,因此月份之间没有依赖关系。

    b. 对上一步进行累积求和。这一步的并行性较差,但不需要处理太多数据。

这样可以跨公司和年月并行处理,减小数据规模。

英文:

>This is significantly faster than the SQL implementation but is it even possible to distribute the computing to speed up? the accumulation step is dependent on all previously snapshots so I cannot wrap my head around how to think in a mapreduce way.

Here are two ideas for how to approach this. The first idea is simpler, and the second idea accommodates more parallelism.

  1. Different companies are independent. You could process MSFT on one core, and employees for a different company on a different core.

    This will work as long as the number of companies you need to process is larger than the number of cores you want to distribute over. I don't know whether that's true in your dataset.

  2. You could split the cumulative sum step into two steps:

    1. Take in tuples of the form (company, yearmonth, join_or_leave). For each combination of company and yearmonth, sum up all tuples within one company and yearmonth and output a tuple of the form (company, yearmonth, num_joins, num_leaves). This is a non-cumulative sum, so there is no dependence between months.

      This lets you parallelize across both company and yearmonth.

      As a back of the envelope calculation, if you have 1000 companies, and 600 months, you'd end up with 600,000 tuples from this step. If you started with 20B rows (10B input * 2 events per join/leave event) then this is a 30,000x reduction in data size.

    2. Take in the previous step, and do a cumulative sum. This step is less parallel, but doesn't need to deal with as much data. (This has number of companies * number of months tuples to deal with.)

      This can also be parallelized across companies, but that may not really be useful, given the reduced data size.

huangapple
  • 本文由 发表于 2023年5月7日 04:10:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76190916.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定