英文:
How to select only rows where column contains alpha characters in spark?
问题
I have a spark table where somehow text is being entered into a date field. I need to be able to select all the affect rows via regex, but am struggling to format the sql for it.
The date should look like this:
2023-02-04T07:01:02Z
so technically a 'Z' is allowed. I was hoping I could do something like this:
select * from db.table
where date like '%[a-zA-Y]%';
But that clearly doesn't work.
英文:
I have a spark table where somehow text is being entered into a date field. I need to be able to select all the affect rows via regex, but am struggling to format the sql for it.
The date should look like this:
2023-02-04T07:01:02Z
so technically a 'Z' is allowed. I was hoping I could do something like this:
select * from db.table
where date like '%[a-zA-Y]%'
But that clearly doesn't work.
答案1
得分: 1
你有几个选项;假设你有这个数据集:
+------------------------+
|my_dates |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z |
|2023-02-04T07:01:02ZTEST|
|TEST |
|20A23-B02-04CT07:01:02Z |
+------------------------+
如果你想容忍一些值(就像你的例子中使用了%
),你可以使用:
select * from table where my_dates rlike '\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z'
它会返回:
+------------------------+
|my_dates |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z |
|2023-02-04T07:01:02ZTEST|
+------------------------+
如果你想进行严格检查,你可以进行完整的正则表达式匹配(检查字符串的开头和结尾):
select * from table where my_dates rlike '^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z%'
它会返回:
+--------------------+
|my_dates |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+
如果你的目标只是过滤掉不是有效时间的值,你甚至可以使用:
select * From table where timestamp(my_dates) is not null
它会返回:
+--------------------+
|my_dates |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+
希望对你有帮助,祝好运!
英文:
You have some options here; let's assume you have this dataset:
+------------------------+
|my_dates |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z |
|2023-02-04T07:01:02ZTEST|
|TEST |
|20A23-B02-04CT07:01:02Z |
+------------------------+
If you want to tolerate some values (as in your example, where you are using %
), you can use:
select * from table where my_dates rlike '\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z'
which returns:
+------------------------+
|my_dates |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z |
|2023-02-04T07:01:02ZTEST|
+------------------------+
If you want to do a strict check, you can do a full regex match (checking the beginning and ending of your string):
select * from table where my_dates rlike '^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z%'
which returns:
+--------------------+
|my_dates |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+
If your goal is simply to filter out the values which are not a valid time, you could even use:
select * From table where timestamp(my_dates) is not null
which returns:
+--------------------+
|my_dates |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+
Hope this helps, good luck!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论