英文:
BigQuery splitting a row into multiple columns
问题
我有一个存储日志的表,它将所有日志存储在一行中。我已经对其进行了展开操作,一个示例行如下:
TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1) 等等。
我想将TestObserver、TestFragment等内容转换为列,将其值(2)、(1)等转换为行。这些测试日志有很多,它们的数量不是固定的。它们之间没有分隔符。我还想在每一行中对每个TestObserver、TestFragment等的值进行求和。
我该如何实现这一目标?
我尝试过修剪和分割,但结果不符合预期。
英文:
I have a table that stores logs into one row. I have unnested it and one example row looks like this:
TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1) and so on.
I would like to conveert TestObserver, TestFragment and so on into columns and its values (2), (1), etc into its rows. There are plenty of these Test logs, it is not a fixated number of them. There is no delimiter between them. I would also like to SUM each TestObserver, TestFragment and so on values found in each row.
How can I achieve this?
I tried trimming and splitting but the result is not what it is expected.
答案1
得分: 0
假设在您的日志中没有一致的分隔符(甚至没有空格)分隔术语,我们仍然可以使用 REGEXP_EXTRACT_ALL() 函数和 BigQuery 中的 UNNEST 运算符来获得您想要的结果。
步骤 1. 使用 regex_replace() 将日志中的所有空格移除。
步骤 2. 使用 REGEXP_EXTRACT_ALL() 将每个(术语(数量))标记提取到一个数组中。
步骤 3. 对数组进行 UNNEST 操作,使每行只包含一个(术语(数量))标记。
步骤 4. 使用 regex_extract() 分别提取术语和数量。请注意,我们需要使用 负向先行正则表达式(?= 和 ?<=) 来获取我们需要的确切数据,用于术语和数量。
步骤 5. 按术语分组并对数量进行求和以获得结果。
请查看以下示例代码:
WITH log_extracted AS (
SELECT
REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(
'TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1)',
'_',
''
),
'\w+\(\d+\)'
) AS log_tokens
)
SELECT
REGEXP_EXTRACT(log_token, '(\w+)(?=\(\d+\))') AS term,
SUM(CAST(REGEXP_EXTRACT(log_token, '(?<=\w\()\d+') AS INT64)) AS quantity
FROM log_extracted
CROSS JOIN UNNEST(log_tokens) AS log_token
GROUP BY
1
示例输出:
term | quantity |
---|---|
TestFragment | 1 |
TestView | 1 |
TestObserver | 2 |
TestNotifications | 2 |
TestActivity | 1 |
英文:
Assume there is no consistent delimiter(not even a space) between terms in your log, we can still get your desired results by using REGEXP_EXTRACT_ALL() function and UNNEST operator in BigQuery.
Step 1. Use regex_replace() to move all spaces in the log
step 2. Use REGEXP_EXTRACT_ALL() to fetch each (term(quantity)) token into an array.
Step 3. UNNEST the array to one (term(quantity)) token per row
Step 4. Use regex_extract() to fetch term and quantity separately. Please, we need to use negative lookaround regex(?= and ?<=) to get the exact data we need for term and quantity.
Step 5. Group by term and sum() on quantity to get the result.
See the following code as example:
WITH log_extracted AS (
SELECT
REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(
'TestObserver(2) TestFragment(1) TestView(1) TestNotifications(2) TestActivity(1)',
'_',
''
),
'\w+\(\d+\)'
) AS log_tokens
)
SELECT
REGEXP_EXTRACT(log_token, '(\w+)(?=\(\d+\))') AS term,
SUM(CAST(REGEXP_EXTRACT(log_token, '(?<=\w\()\d+') AS INT64)) AS quantity
FROM log_extracted
CROSS JOIN UNNEST(log_tokens) AS log_token
GROUP BY
1
example output:
term | quantity |
---|---|
TestFragment | 1 |
TestView | 1 |
TestObserver | 2 |
TestNotifications | 2 |
TestActivity | 1 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论