在Spark中,如何仅选择包含字母字符的行?

huangapple go评论58阅读模式
英文:

How to select only rows where column contains alpha characters in spark?

问题

I have a spark table where somehow text is being entered into a date field. I need to be able to select all the affect rows via regex, but am struggling to format the sql for it.

The date should look like this:

2023-02-04T07:01:02Z

so technically a 'Z' is allowed. I was hoping I could do something like this:

  select * from db.table
  where  date like '%[a-zA-Y]%';

But that clearly doesn't work.

英文:

I have a spark table where somehow text is being entered into a date field. I need to be able to select all the affect rows via regex, but am struggling to format the sql for it.

The date should look like this:

2023-02-04T07:01:02Z

so technically a 'Z' is allowed. I was hoping I could do something like this:

  select * from db.table
  where  date like '%[a-zA-Y]%'

But that clearly doesn't work.

答案1

得分: 1

你有几个选项;假设你有这个数据集:

+------------------------+
|my_dates                |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z    |
|2023-02-04T07:01:02ZTEST|
|TEST                    |
|20A23-B02-04CT07:01:02Z |
+------------------------+

如果你想容忍一些值(就像你的例子中使用了%),你可以使用:

select * from table where my_dates rlike '\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z'

它会返回:

+------------------------+
|my_dates                |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z    |
|2023-02-04T07:01:02ZTEST|
+------------------------+

如果你想进行严格检查,你可以进行完整的正则表达式匹配(检查字符串的开头和结尾):

select * from table where my_dates rlike '^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z%'

它会返回:

+--------------------+
|my_dates            |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+

如果你的目标只是过滤掉不是有效时间的值,你甚至可以使用:

select * From table where timestamp(my_dates) is not null

它会返回:

+--------------------+
|my_dates            |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+

希望对你有帮助,祝好运!

英文:

You have some options here; let's assume you have this dataset:

+------------------------+
|my_dates                |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z    |
|2023-02-04T07:01:02ZTEST|
|TEST                    |
|20A23-B02-04CT07:01:02Z |
+------------------------+

If you want to tolerate some values (as in your example, where you are using %), you can use:

select * from table where my_dates rlike '\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z'

which returns:

+------------------------+
|my_dates                |
+------------------------+
|ASD2023-02-04T07:01:02Z |
|2023-02-04T07:01:02Z    |
|2023-02-04T07:01:02ZTEST|
+------------------------+

If you want to do a strict check, you can do a full regex match (checking the beginning and ending of your string):

select * from table where my_dates rlike '^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z%'

which returns:

+--------------------+
|my_dates            |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+

If your goal is simply to filter out the values which are not a valid time, you could even use:

select * From table where timestamp(my_dates) is not null

which returns:

+--------------------+
|my_dates            |
+--------------------+
|2023-02-04T07:01:02Z|
+--------------------+

Hope this helps, good luck!

huangapple
  • 本文由 发表于 2023年5月11日 04:19:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222276.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定