How to find the number of occurrences of sub-strings in another string in BigQuery with conditionals as a cumulative sum?

huangapple go评论58阅读模式
英文:

How to find the number of occurrences of sub-strings in another string in BigQuery with conditionals as a cumulative sum?

问题

所以我的SQL/BigQuery技能仍然需要一些完善,我在这个问题上卡住了一段时间。我有一个标记了一些应用程序计划修改的表ST,其中包含它们的起始日期,这些修改仅适用于具有特定规格的应用程序:

Mod_ID App_ID Specs Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 S4 2022-03-15
P3 App3 S4 S5 2022-04-10

我有另一张表AQ,其中包含一些设备的生产日期,以及它们对应的应用程序和规格的字符串形式:

Device_ID App_ID Spec_string Prod_Date
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03
ID0002 App2 S2 S5 U9 2022-02-05
ID0003 App2 S1 S2 S3 2022-03-12
ID0004 App3 S5 S6 T0 U7 2022-04-18

我想要计算特定设备的不同规格修改的发生次数,这些修改适用于其App_ID,并且仅适用于AQ.Prod_Date >= ST.Start_Date的情况。

我首先对表ST进行了展开(ST_UNN),最终得到了这样的结果:

Mod_ID App_ID Specs_ind Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 2022-03-15
P2 App2 S4 2022-03-15
P3 App3 S4 2022-04-10
P3 App3 S5 2022-04-10

然后我尝试了以下操作:

SELECT
  a.* EXCEPT(Spec_String),
  SUM(
    CASE
      WHEN s.App_ID = a.App_ID AND s.Start_Date <= a.Prod_Date THEN ARRAY_LENGTH(REGEXP_EXTRACT_ALL(Spec_String, Specs_ind))
    ELSE
    0
  END
    ) OVER(PARTITION BY s.Specs_ind ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS num_changes,
FROM
  AQ AS a
LEFT JOIN
  ST_UNN AS s
ON
  a.App_ID  = s.App_ID 

但它并没有按预期工作。我期望得到类似于这样的结果:

Device_ID App_ID Spec_string Prod_Date num_changes
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03 1
ID0002 App2 S2 S5 U9 2022-02-05 1
ID0003 App2 S1 S3 U7 2022-03-12 0
ID0004 App3 S5 S6 T0 U7 2022-04-18 1

但我对如何实现最后部分作为累积总和感到有些迷茫,以便设备将计算其生产日期之前发生的所有修改,并基于其规格应用。

英文:

So my SQL/BiqQuery skills still need some refining and I have been stuck on this for some time now. I got a table ST that marks the planned mods to some applications with their starting date, these mods only apply to those with some specific specs:

Mod_ID App_ID Specs Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 S4 2022-03-15
P3 App3 S4 S5 2022-04-10

I got another table AQ that includes the production date for some devices, with their corresponding app and their specifications in the form of a string:

Device_ID App_ID Spec_string Prod_Date
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03
ID0002 App2 S2 S5 U9 2022-02-05
ID0003 App2 S1 S2 S3 2022-03-12
ID0004 App3 S5 S6 T0 U7 2022-04-18

I want to count the occurrences of the different spec mods for a particuar device that apply to its App_ID and only for those that AQ.Prod_Date>=ST.Start_Date.

I first unnested table ST (ST_UNN) and ended with something like this:

Mod_ID App_ID Specs_ind Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 2022-03-15
P2 App2 S4 2022-03-15
P3 App3 S4 2022-04-10
P3 App3 S5 2022-04-10

and tried the following:

SELECT
  a.* EXCEPT(Spec_String),
  SUM(
    CASE
      WHEN s.App_ID = a.App_ID AND s.Start_Date &lt;= a.Prod_Date THEN ARRAY_LENGTH(REGEXP_EXTRACT_ALL(Spec_String, Specs_ind))
    ELSE
    0
  END
    ) OVER(PARTITION BY s.Specs_ind ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS num_changes,
FROM
  AQ AS a
LEFT JOIN
  ST_UNN AS s
ON
  a.App_ID  = s.App_ID 

But it didn't work as intended. I expect something like this:

Device_ID App_ID Spec_string Prod_Date num_changes
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03 1
ID0002 App2 S2 S5 U9 2022-02-05 1
ID0003 App2 S1 S3 U7 2022-03-12 0
ID0004 App3 S5 S6 T0 U7 2022-04-18 1

But I am rather lost on how to achieve this last part as a cumulative sum, so that a device will count all mod implementations that have happened before its production date and that apply based on its specifications.

Any feedback is greatly appreciated!

答案1

得分: 1

如果我理解正确,您可以简化一下,只需按device_id等进行分组,而不使用窗口函数。我稍微修改了您的输入,只是为了显示它按预期工作。

WITH
  st AS (
  SELECT
    'P1' AS mod_id,
    'App1' AS app_id,
    'S1 S4' AS specs,
    '2022-01-24' AS start_date
  UNION ALL
  SELECT
    'P1',
    'App2',
    'S2',
    '2022-01-24'
  UNION ALL
  SELECT
    'P2',
    'App2',
    'S3 S4',
    '2022-03-15'
  UNION ALL
  SELECT
    'P3',
    'App3',
    'S4 S5',
    '2022-04-10'
  UNION ALL
  SELECT
    'P4',
    'App3',
    'S4 S5',
    '2022-04-11' ),
  st_unn AS (
  SELECT
    * EXCEPT(specs)
  FROM
    st,
    UNNEST(SPLIT(specs, ' ')) AS spec ),
  aq AS (
    SELECT
    -- 应该为2
    'ID0001' AS device_id,
    'App1' AS app_id,
    'S1 S4 S6 T0 U7' AS spec_string,
    '2022-02-03' AS prod_date
  UNION ALL
  SELECT
    -- 应该为1
    'ID0002',
    'App2',
    'S2 S5 U9',
    '2022-02-05'
  UNION ALL
  SELECT
    -- 应该为0
    'ID0003',
    'App2',
    'S2 S5 U9',
    '2021-03-12'
  UNION ALL
  SELECT
    -- 应该为3
    'ID0004',
    'App3',
    'S4 S5 S6 T0 U7',
    '2022-04-18' )
SELECT
  device_id,
  aq.app_id,
  spec_string,
  prod_date,
  SUM(COALESCE(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(spec_string, spec)), 0))
FROM
  aq
LEFT JOIN
  st_unn st
ON
  aq.prod_date >= st.start_date
  AND aq.app_id = st.app_id
GROUP BY
  device_id,
  aq.app_id,
  spec_string,
  prod_date
英文:

If I am understanding this correctly, you can simplify a little and just group by device_id, etc. instead of using a window function. I modified your input slightly just to show that this was working as intended

WITH
  st AS (
  SELECT
    &#39;P1&#39; AS mod_id,
    &#39;App1&#39; AS app_id,
    &#39;S1 S4&#39; AS specs,
    &#39;2022-01-24&#39; AS start_date
  UNION ALL
  SELECT
    &#39;P1&#39;,
    &#39;App2&#39;,
    &#39;S2&#39;,
    &#39;2022-01-24&#39;
  UNION ALL
  SELECT
    &#39;P2&#39;,
    &#39;App2&#39;,
    &#39;S3 S4&#39;,
    &#39;2022-03-15&#39;
  UNION ALL
  SELECT
    &#39;P3&#39;,
    &#39;App3&#39;,
    &#39;S4 S5&#39;,
    &#39;2022-04-10&#39;
  UNION ALL
  SELECT
    &#39;P4&#39;,
    &#39;App3&#39;,
    &#39;S4 S5&#39;,
    &#39;2022-04-11&#39; ),
  st_unn AS (
  SELECT
    * EXCEPT(specs)
  FROM
    st,
    UNNEST(SPLIT(specs, &#39; &#39;)) AS spec ),
  aq AS (
  SELECT
    -- should be 2
    &#39;ID0001&#39; AS device_id,
    &#39;App1&#39; AS app_id,
    &#39;S1 S4 S6 T0 U7&#39; AS spec_string,
    &#39;2022-02-03&#39; AS prod_date
  UNION ALL
  SELECT
    -- should be 1
    &#39;ID0002&#39;,
    &#39;App2&#39;,
    &#39;S2 S5 U9&#39;,
    &#39;2022-02-05&#39;
  UNION ALL
  SELECT
  -- should be 0
    &#39;ID0003&#39;,
    &#39;App2&#39;,
    &#39;S2 S5 U9&#39;,
    &#39;2021-03-12&#39;
  UNION ALL
  SELECT
    -- should be 3
    &#39;ID0004&#39;,
    &#39;App3&#39;,
    &#39;S4 S5 S6 T0 U7&#39;,
    &#39;2022-04-18&#39; )
SELECT
  device_id,
  aq.app_id,
  spec_string,
  prod_date,
  SUM(COALESCE(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(spec_string, spec)), 0))
FROM
  aq
LEFT JOIN
  st_unn st
ON
  aq.prod_date &gt;= st.start_date
  AND aq.app_id = st.app_id
GROUP BY
  device_id,
  aq.app_id,
  spec_string,
  prod_date

huangapple
  • 本文由 发表于 2023年3月10日 01:38:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75688188.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定