BigQuery将一行拆分为多个列。

huangapple go评论67阅读模式
英文:

BigQuery splitting a row into multiple columns

问题

我有一个存储日志的表,它将所有日志存储在一行中。我已经对其进行了展开操作,一个示例行如下:

TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1) 等等。

我想将TestObserver、TestFragment等内容转换为列,将其值(2)、(1)等转换为行。这些测试日志有很多,它们的数量不是固定的。它们之间没有分隔符。我还想在每一行中对每个TestObserver、TestFragment等的值进行求和。

我该如何实现这一目标?

我尝试过修剪和分割,但结果不符合预期。

英文:

I have a table that stores logs into one row. I have unnested it and one example row looks like this:

TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1) and so on.

I would like to conveert TestObserver, TestFragment and so on into columns and its values (2), (1), etc into its rows. There are plenty of these Test logs, it is not a fixated number of them. There is no delimiter between them. I would also like to SUM each TestObserver, TestFragment and so on values found in each row.

How can I achieve this?

I tried trimming and splitting but the result is not what it is expected.

答案1

得分: 0

假设在您的日志中没有一致的分隔符(甚至没有空格)分隔术语,我们仍然可以使用 REGEXP_EXTRACT_ALL() 函数和 BigQuery 中的 UNNEST 运算符来获得您想要的结果。

步骤 1. 使用 regex_replace() 将日志中的所有空格移除。

步骤 2. 使用 REGEXP_EXTRACT_ALL() 将每个(术语(数量))标记提取到一个数组中。

步骤 3. 对数组进行 UNNEST 操作,使每行只包含一个(术语(数量))标记。

步骤 4. 使用 regex_extract() 分别提取术语和数量。请注意,我们需要使用 负向先行正则表达式(?= 和 ?<=) 来获取我们需要的确切数据,用于术语和数量。

步骤 5. 按术语分组并对数量进行求和以获得结果。

请查看以下示例代码:

WITH log_extracted AS (
  SELECT
    REGEXP_EXTRACT_ALL(
      REGEXP_REPLACE(
        'TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1)',
        '_',
        ''
      ),
      '\w+\(\d+\)'
    ) AS log_tokens
)
SELECT
  REGEXP_EXTRACT(log_token, '(\w+)(?=\(\d+\))') AS term,
  SUM(CAST(REGEXP_EXTRACT(log_token, '(?<=\w\()\d+') AS INT64)) AS quantity
FROM log_extracted
CROSS JOIN UNNEST(log_tokens) AS log_token
GROUP BY
  1

示例输出:

term quantity
TestFragment 1
TestView 1
TestObserver 2
TestNotifications 2
TestActivity 1
英文:

Assume there is no consistent delimiter(not even a space) between terms in your log, we can still get your desired results by using REGEXP_EXTRACT_ALL() function and UNNEST operator in BigQuery.

Step 1. Use regex_replace() to move all spaces in the log

step 2. Use REGEXP_EXTRACT_ALL() to fetch each (term(quantity)) token into an array.

Step 3. UNNEST the array to one (term(quantity)) token per row

Step 4. Use regex_extract() to fetch term and quantity separately. Please, we need to use negative lookaround regex(?= and ?<=) to get the exact data we need for term and quantity.

Step 5. Group by term and sum() on quantity to get the result.

See the following code as example:

WITH log_extracted AS (
  SELECT
    REGEXP_EXTRACT_ALL(
      REGEXP_REPLACE(
        &#39;TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1)&#39;,
        &#39;_&#39;,
        &#39;&#39;
      ),
      &#39;\w+\(\d+\)&#39;
    ) AS log_tokens
)
SELECT
  REGEXP_EXTRACT(log_token, &#39;(\w+)(?=\(\d+\))&#39;) AS term,
  SUM(CAST(REGEXP_EXTRACT(log_token, &#39;(?&lt;=\w\()\d+&#39;) AS INT64)) AS quantity
FROM log_extracted
CROSS JOIN UNNEST(log_tokens) AS log_token
GROUP BY
  1

example output:

term quantity
TestFragment 1
TestView 1
TestObserver 2
TestNotifications 2
TestActivity 1

huangapple
  • 本文由 发表于 2023年2月23日 19:53:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75544482.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定