英文:
How to get min value or desired value in given string when string is having slash in between
问题
以下是已经翻译好的内容:
我有一列数据,其中在数字之间有斜杠,例如下面所示,无论字符串中的数字出现在哪里,我都需要获取最小值,只有当数字和字母混合出现时,我才需要获取混合的部分。这需要在pyspark的数据框中完成。
示例输入:
- 111/112
- 113/PAG
- 801/802/803/804
- 801/62S
期望的输出应该是:
- 111
- PAG
- 801
- 62S
我尝试了将数据框列拆分,但不起作用。请帮助我解决这个问题。
英文:
I have a column which is having slash in between for example given below, where ever numbers are present in a string I need to get min value where ever their is number and alpha numeric then I need to get only alpha numeric. This has to be done in pysaprk dataframe.
Example input:
- 111/112
- 113/PAG
- 801/802/803/804
- 801/62S
Desired output should be
- 111
- PAG
- 801
- 62S
I have tried exploding the dataframe column but it doesn't work. please help me on this.
答案1
得分: 0
尝试使用**array_min
函数,通过使用内置函数split
**。
split
-> 拆分字符串并创建数组array_min
-> 从数组中获取最小值
示例:
df = spark.createDataFrame([('111/112',),('113/PAG',),('801/802',),('801/62S',)],['ip'])
df.withColumn("ip",array_min(split(col("ip"),"/"))).show(10,False)
#+---+
#|ip |
#+---+
#|111|
#|113|
#|801|
#|62S|
#+---+
更新:
from pyspark.sql.functions import *
df = spark.createDataFrame([('111/112',),('113/PAG/PAZ',),('801/802',),('801/62S',)],['ip'])
df.withColumn("temp_ip1", split(col("ip"),"/").cast("array<int>")).\
withColumn("temp_ip2", split(col("ip"),"/")).\
withColumn("temp", array_except(col("temp_ip2"),array_except(col("temp_ip1"),array(lit(None))).cast("array<string>"))).\
withColumn("min_ip", array_min(when(size(col("temp"))>0,col("temp")).otherwise(col("temp_ip2")))).\
drop(*['temp_ip1','temp_ip2','temp']).\
show(10,False)
#+-----------+------+
#|ip |min_ip|
#+-----------+------+
#|111/112 |111 |
#|113/PAG/PAZ|PAG |
#|801/802 |801 |
#|801/62S |62S |
#+-----------+------+
英文:
Try with array_min
function by using split
inbuilt function.
split
-> splits the string and create arrayarray_min
-> get minimum value from the array
Example:
df = spark.createDataFrame([('111/112',),('113/PAG',),('801/802',),('801/62S',)],['ip'])
df.withColumn("ip",array_min(split(col("ip"),"/"))).show(10,False)
#+---+
#|ip |
#+---+
#|111|
#|113|
#|801|
#|62S|
#+---+
UPDATE:
from pyspark.sql.functions import *
df = spark.createDataFrame([('111/112',),('113/PAG/PAZ',),('801/802',),('801/62S',)],['ip'])
df.withColumn("temp_ip1", split(col("ip"),"/").cast("array<int>")).\
withColumn("temp_ip2", split(col("ip"),"/")).\
withColumn("temp", array_except(col("temp_ip2"),array_except(col("temp_ip1"),array(lit(None))).cast("array<string>"))).\
withColumn("min_ip", array_min(when(size(col("temp"))>0,col("temp")).otherwise(col("temp_ip2")))).\
drop(*['temp_ip1','temp_ip2','temp']).\
show(10,False)
#+-----------+------+
#|ip |min_ip|
#+-----------+------+
#|111/112 |111 |
#|113/PAG/PAZ|PAG |
#|801/802 |801 |
#|801/62S |62S |
#+-----------+------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论