使用正则表达式检查BigQuery上日期的正确格式。

huangapple go评论59阅读模式
英文:

Checking with regex right format of a date on BigQuery

问题

我需要查找表的一列是否存在一些日期格式错误,所以我编写了这个查询:

SELECT *
FROM db_logs.FACT_LOG_WEBSITE
WHERE NOT REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{6} UTC$')

尽管我使用了 NOT,但这个查询仍然返回日期格式为 "2021-07-27 22:11:53.666330 UTC" 的数据,它符合正则表达式,即使我检查了正则表达式验证器,也没有问题。

英文:

I have to find for a column of a table if there are some date bad formatted, so I wrote this query:

SELECT *
FROM db_logs.FACT_LOG_WEBSITE
WHERE NOT REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{6} UTC$')

This query still return date like "2021-07-27 22:11:53.666330 UTC" that respect the regex even if I'm using NOT.

I also check on the regex validator to understand if it's a problem of regex but it's working good.

答案1

得分: 0

我能理解你的困惑 - 你有一些时间戳,看着它你正在创建正则表达式,但这将解释给你真正发生了什么以及对你有效的是什么。

SELECT logged_in_at as logged_in_at_as_timestamp, cast(logged_in_at as string) logged_in_at_as_string, length(cast(logged_in_at as string)) as length_of_date_string,
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{6} UTC$') as existing_regex,
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{6}$') as new_regex,
safe_cast(logged_in_at as DATE) as parsed_logged_in_at_date,
safe_cast(bad_logged_in_at as DATE) as parsed_bad_logged_in_at_date
FROM (select current_timestamp() as logged_in_at, current_timestamp()||"00" as bad_logged_in_at)

你可以检查 safe_cast 验证以确保你的时间戳字符串是真实的日期,或者你可以改变你的正则表达式,或者你可以做一些其他的事情。但如果你看到前两列,你就会明白为什么你的正则表达式失败了。

英文:

I can see your confusion - you have some timestamps and looking at it you are creating the regex, but this will explain you what really happens and what will work for you.

SELECT logged_in_at as logged_in_at_as_timestamp, cast(logged_in_at as string) logged_in_at_as_string, length(cast(logged_in_at as string)) as length_of_date_string, 
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}\:\d{2}\.\d{6} UTC$') as existing_regex, 
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}\:\d{2}\.\d{6}') as new_regex,
safe_cast(logged_in_at as DATE) as parsed_logged_in_at_date,
safe_cast(bad_logged_in_at as DATE) as parsed_bad_logged_in_at_date
 FROM (select current_timestamp() as logged_in_at, current_timestamp()||"00" as bad_logged_in_at)  

使用正则表达式检查BigQuery上日期的正确格式。

You can check either safe_cast validation to make sure your timestamp string is real date or not Or you can change your regex, or you can do some other thing. But if you see the first 2 columns you will see why your regex was failing.

huangapple
  • 本文由 发表于 2023年7月17日 21:16:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704869.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定