How to find the number of occurrences of sub-strings in another string in BigQuery with conditionals as a cumulative sum?

huangapple go评论99阅读模式
英文:

How to find the number of occurrences of sub-strings in another string in BigQuery with conditionals as a cumulative sum?

问题

所以我的SQL/BigQuery技能仍然需要一些完善,我在这个问题上卡住了一段时间。我有一个标记了一些应用程序计划修改的表ST,其中包含它们的起始日期,这些修改仅适用于具有特定规格的应用程序:

Mod_ID App_ID Specs Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 S4 2022-03-15
P3 App3 S4 S5 2022-04-10

我有另一张表AQ,其中包含一些设备的生产日期,以及它们对应的应用程序和规格的字符串形式:

Device_ID App_ID Spec_string Prod_Date
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03
ID0002 App2 S2 S5 U9 2022-02-05
ID0003 App2 S1 S2 S3 2022-03-12
ID0004 App3 S5 S6 T0 U7 2022-04-18

我想要计算特定设备的不同规格修改的发生次数,这些修改适用于其App_ID,并且仅适用于AQ.Prod_Date >= ST.Start_Date的情况。

我首先对表ST进行了展开(ST_UNN),最终得到了这样的结果:

Mod_ID App_ID Specs_ind Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 2022-03-15
P2 App2 S4 2022-03-15
P3 App3 S4 2022-04-10
P3 App3 S5 2022-04-10

然后我尝试了以下操作:

  1. SELECT
  2. a.* EXCEPT(Spec_String),
  3. SUM(
  4. CASE
  5. WHEN s.App_ID = a.App_ID AND s.Start_Date <= a.Prod_Date THEN ARRAY_LENGTH(REGEXP_EXTRACT_ALL(Spec_String, Specs_ind))
  6. ELSE
  7. 0
  8. END
  9. ) OVER(PARTITION BY s.Specs_ind ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS num_changes,
  10. FROM
  11. AQ AS a
  12. LEFT JOIN
  13. ST_UNN AS s
  14. ON
  15. a.App_ID = s.App_ID

但它并没有按预期工作。我期望得到类似于这样的结果:

Device_ID App_ID Spec_string Prod_Date num_changes
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03 1
ID0002 App2 S2 S5 U9 2022-02-05 1
ID0003 App2 S1 S3 U7 2022-03-12 0
ID0004 App3 S5 S6 T0 U7 2022-04-18 1

但我对如何实现最后部分作为累积总和感到有些迷茫,以便设备将计算其生产日期之前发生的所有修改,并基于其规格应用。

英文:

So my SQL/BiqQuery skills still need some refining and I have been stuck on this for some time now. I got a table ST that marks the planned mods to some applications with their starting date, these mods only apply to those with some specific specs:

Mod_ID App_ID Specs Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 S4 2022-03-15
P3 App3 S4 S5 2022-04-10

I got another table AQ that includes the production date for some devices, with their corresponding app and their specifications in the form of a string:

Device_ID App_ID Spec_string Prod_Date
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03
ID0002 App2 S2 S5 U9 2022-02-05
ID0003 App2 S1 S2 S3 2022-03-12
ID0004 App3 S5 S6 T0 U7 2022-04-18

I want to count the occurrences of the different spec mods for a particuar device that apply to its App_ID and only for those that AQ.Prod_Date>=ST.Start_Date.

I first unnested table ST (ST_UNN) and ended with something like this:

Mod_ID App_ID Specs_ind Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 2022-03-15
P2 App2 S4 2022-03-15
P3 App3 S4 2022-04-10
P3 App3 S5 2022-04-10

and tried the following:

  1. SELECT
  2. a.* EXCEPT(Spec_String),
  3. SUM(
  4. CASE
  5. WHEN s.App_ID = a.App_ID AND s.Start_Date &lt;= a.Prod_Date THEN ARRAY_LENGTH(REGEXP_EXTRACT_ALL(Spec_String, Specs_ind))
  6. ELSE
  7. 0
  8. END
  9. ) OVER(PARTITION BY s.Specs_ind ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS num_changes,
  10. FROM
  11. AQ AS a
  12. LEFT JOIN
  13. ST_UNN AS s
  14. ON
  15. a.App_ID = s.App_ID

But it didn't work as intended. I expect something like this:

Device_ID App_ID Spec_string Prod_Date num_changes
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03 1
ID0002 App2 S2 S5 U9 2022-02-05 1
ID0003 App2 S1 S3 U7 2022-03-12 0
ID0004 App3 S5 S6 T0 U7 2022-04-18 1

But I am rather lost on how to achieve this last part as a cumulative sum, so that a device will count all mod implementations that have happened before its production date and that apply based on its specifications.

Any feedback is greatly appreciated!

答案1

得分: 1

如果我理解正确,您可以简化一下,只需按device_id等进行分组,而不使用窗口函数。我稍微修改了您的输入,只是为了显示它按预期工作。

  1. WITH
  2. st AS (
  3. SELECT
  4. 'P1' AS mod_id,
  5. 'App1' AS app_id,
  6. 'S1 S4' AS specs,
  7. '2022-01-24' AS start_date
  8. UNION ALL
  9. SELECT
  10. 'P1',
  11. 'App2',
  12. 'S2',
  13. '2022-01-24'
  14. UNION ALL
  15. SELECT
  16. 'P2',
  17. 'App2',
  18. 'S3 S4',
  19. '2022-03-15'
  20. UNION ALL
  21. SELECT
  22. 'P3',
  23. 'App3',
  24. 'S4 S5',
  25. '2022-04-10'
  26. UNION ALL
  27. SELECT
  28. 'P4',
  29. 'App3',
  30. 'S4 S5',
  31. '2022-04-11' ),
  32. st_unn AS (
  33. SELECT
  34. * EXCEPT(specs)
  35. FROM
  36. st,
  37. UNNEST(SPLIT(specs, ' ')) AS spec ),
  38. aq AS (
  39. SELECT
  40. -- 应该为2
  41. 'ID0001' AS device_id,
  42. 'App1' AS app_id,
  43. 'S1 S4 S6 T0 U7' AS spec_string,
  44. '2022-02-03' AS prod_date
  45. UNION ALL
  46. SELECT
  47. -- 应该为1
  48. 'ID0002',
  49. 'App2',
  50. 'S2 S5 U9',
  51. '2022-02-05'
  52. UNION ALL
  53. SELECT
  54. -- 应该为0
  55. 'ID0003',
  56. 'App2',
  57. 'S2 S5 U9',
  58. '2021-03-12'
  59. UNION ALL
  60. SELECT
  61. -- 应该为3
  62. 'ID0004',
  63. 'App3',
  64. 'S4 S5 S6 T0 U7',
  65. '2022-04-18' )
  66. SELECT
  67. device_id,
  68. aq.app_id,
  69. spec_string,
  70. prod_date,
  71. SUM(COALESCE(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(spec_string, spec)), 0))
  72. FROM
  73. aq
  74. LEFT JOIN
  75. st_unn st
  76. ON
  77. aq.prod_date >= st.start_date
  78. AND aq.app_id = st.app_id
  79. GROUP BY
  80. device_id,
  81. aq.app_id,
  82. spec_string,
  83. prod_date
英文:

If I am understanding this correctly, you can simplify a little and just group by device_id, etc. instead of using a window function. I modified your input slightly just to show that this was working as intended

  1. WITH
  2. st AS (
  3. SELECT
  4. &#39;P1&#39; AS mod_id,
  5. &#39;App1&#39; AS app_id,
  6. &#39;S1 S4&#39; AS specs,
  7. &#39;2022-01-24&#39; AS start_date
  8. UNION ALL
  9. SELECT
  10. &#39;P1&#39;,
  11. &#39;App2&#39;,
  12. &#39;S2&#39;,
  13. &#39;2022-01-24&#39;
  14. UNION ALL
  15. SELECT
  16. &#39;P2&#39;,
  17. &#39;App2&#39;,
  18. &#39;S3 S4&#39;,
  19. &#39;2022-03-15&#39;
  20. UNION ALL
  21. SELECT
  22. &#39;P3&#39;,
  23. &#39;App3&#39;,
  24. &#39;S4 S5&#39;,
  25. &#39;2022-04-10&#39;
  26. UNION ALL
  27. SELECT
  28. &#39;P4&#39;,
  29. &#39;App3&#39;,
  30. &#39;S4 S5&#39;,
  31. &#39;2022-04-11&#39; ),
  32. st_unn AS (
  33. SELECT
  34. * EXCEPT(specs)
  35. FROM
  36. st,
  37. UNNEST(SPLIT(specs, &#39; &#39;)) AS spec ),
  38. aq AS (
  39. SELECT
  40. -- should be 2
  41. &#39;ID0001&#39; AS device_id,
  42. &#39;App1&#39; AS app_id,
  43. &#39;S1 S4 S6 T0 U7&#39; AS spec_string,
  44. &#39;2022-02-03&#39; AS prod_date
  45. UNION ALL
  46. SELECT
  47. -- should be 1
  48. &#39;ID0002&#39;,
  49. &#39;App2&#39;,
  50. &#39;S2 S5 U9&#39;,
  51. &#39;2022-02-05&#39;
  52. UNION ALL
  53. SELECT
  54. -- should be 0
  55. &#39;ID0003&#39;,
  56. &#39;App2&#39;,
  57. &#39;S2 S5 U9&#39;,
  58. &#39;2021-03-12&#39;
  59. UNION ALL
  60. SELECT
  61. -- should be 3
  62. &#39;ID0004&#39;,
  63. &#39;App3&#39;,
  64. &#39;S4 S5 S6 T0 U7&#39;,
  65. &#39;2022-04-18&#39; )
  66. SELECT
  67. device_id,
  68. aq.app_id,
  69. spec_string,
  70. prod_date,
  71. SUM(COALESCE(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(spec_string, spec)), 0))
  72. FROM
  73. aq
  74. LEFT JOIN
  75. st_unn st
  76. ON
  77. aq.prod_date &gt;= st.start_date
  78. AND aq.app_id = st.app_id
  79. GROUP BY
  80. device_id,
  81. aq.app_id,
  82. spec_string,
  83. prod_date

huangapple
  • 本文由 发表于 2023年3月10日 01:38:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75688188.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定