英文:
Checking with regex right format of a date on BigQuery
问题
我需要查找表的一列是否存在一些日期格式错误,所以我编写了这个查询:
SELECT *
FROM db_logs.FACT_LOG_WEBSITE
WHERE NOT REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{6} UTC$')
尽管我使用了 NOT
,但这个查询仍然返回日期格式为 "2021-07-27 22:11:53.666330 UTC" 的数据,它符合正则表达式,即使我检查了正则表达式验证器,也没有问题。
英文:
I have to find for a column of a table if there are some date bad formatted, so I wrote this query:
SELECT *
FROM db_logs.FACT_LOG_WEBSITE
WHERE NOT REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{6} UTC$')
This query still return date like "2021-07-27 22:11:53.666330 UTC" that respect the regex even if I'm using NOT
.
I also check on the regex validator to understand if it's a problem of regex but it's working good.
答案1
得分: 0
我能理解你的困惑 - 你有一些时间戳,看着它你正在创建正则表达式,但这将解释给你真正发生了什么以及对你有效的是什么。
SELECT logged_in_at as logged_in_at_as_timestamp, cast(logged_in_at as string) logged_in_at_as_string, length(cast(logged_in_at as string)) as length_of_date_string,
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{6} UTC$') as existing_regex,
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{6}$') as new_regex,
safe_cast(logged_in_at as DATE) as parsed_logged_in_at_date,
safe_cast(bad_logged_in_at as DATE) as parsed_bad_logged_in_at_date
FROM (select current_timestamp() as logged_in_at, current_timestamp()||"00" as bad_logged_in_at)
你可以检查 safe_cast 验证以确保你的时间戳字符串是真实的日期,或者你可以改变你的正则表达式,或者你可以做一些其他的事情。但如果你看到前两列,你就会明白为什么你的正则表达式失败了。
英文:
I can see your confusion - you have some timestamps and looking at it you are creating the regex, but this will explain you what really happens and what will work for you.
SELECT logged_in_at as logged_in_at_as_timestamp, cast(logged_in_at as string) logged_in_at_as_string, length(cast(logged_in_at as string)) as length_of_date_string,
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}\:\d{2}\.\d{6} UTC$') as existing_regex,
REGEXP_CONTAINS(CAST(logged_in_at AS STRING), r'^\d{4}\-\d{2}\-\d{2} \d{2}\:\d{2}\:\d{2}\.\d{6}') as new_regex,
safe_cast(logged_in_at as DATE) as parsed_logged_in_at_date,
safe_cast(bad_logged_in_at as DATE) as parsed_bad_logged_in_at_date
FROM (select current_timestamp() as logged_in_at, current_timestamp()||"00" as bad_logged_in_at)
You can check either safe_cast validation to make sure your timestamp string is real date or not Or you can change your regex, or you can do some other thing. But if you see the first 2 columns you will see why your regex was failing.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论