英文:
Pyspark: Split and conditional statements
问题
# 我尝试创建一个名为"w"的列,如果我分割值,然后创建一个条件表,如果我找到一个带有"<"符号的值,那么该值应该减去0.1。当你找到一个带有"+"时,你应该只消除+。
dataframe = dataframe.withColumn("x", split(col("x"), "-").getItem(0))
# 数据
data = [["1", "Amit", "DU", "I", "<25"],
["2", "Mohit", "DU", "I", "<25"],
["3", "rohith", "BHU", "I", "35-40"],
["4", "sridevi", "LPU", "I", "30-35"],
["1", "sravan", "KLMP", "M", "25-30"],
["5", "gnanesh", "IIT", "M", "40-45"],
["5", "gnadesh", "KLM", "c", "+45"]]
# 列名
columns = ['ID', 'NAME', 'college', 'metric', 'x']
dataframe = spark.createDataFrame(data, columns)
我的输出是这样的:
+---+-------+-------+------+--------+
| ID| NAME|college|metric| x|
+---+-------+-------+------+--------+
| 1| Amit| DU| I| <25|
| 2| Mohit| DU| I| <25|
| 3| rohith| BHU| I| 35 - 40|
| 4|sridevi| LPU| I| 30 - 35|
| 1| sravan| KLMP| M| 25 - 30|
| 5|gnanesh| IIT| M| 40 - 45|
| 5|gnadesh| KLM| c| +45|
+---+-------+-------+------+--------+
我的输出应该看起来像这样:
+---+-------+-------+------+--------+----+
| ID| NAME|college|metric| x| w|
+---+-------+-------+------+--------+----+
| 1| Amit| DU| I| <25|24.9|
| 2| Mohit| DU| I| <25|24.9|
| 3| rohith| BHU| I| 35 - 40| 35|
| 4|sridevi| LPU| I| 30 - 35| 30|
| 1| sravan| KLMP| M| 25 - 30| 25 |
| 5|gnanesh| IIT| M| 40 - 45| 40 |
| 5|gnadesh| KLM| c| +45| 45 |
+---+-------+-------+------+--------+----+
英文:
I try to create a column called "w" in which If I split the values and then I create a conditional table in which If I find a value with the "<" smybol then that value should be substracted -0.1. When you find a value with "+" when you just should eliminate the +.
I tried this the split but I need to write the conditions.
Tahnk you for your help
dataframe = dataframe.withColumn("x", split(col("x"), "-").getItem(0))
data = [["1", "Amit", "DU", "I", "<25"],
["2", "Mohit", "DU", "I", "<25"],
["3", "rohith", "BHU", "I", 35-40],
["4", "sridevi", "LPU", "I", 30-35],
["1", "sravan", "KLMP", "M", 25-30],
["5", "gnanesh", "IIT", "M", 40-45],
["5", "gnadesh", "KLM", "c", "+45"]]
columns = ['ID', 'NAME', 'college', 'metric', 'x']
dataframe = spark.createDataFrame(data, columns)
My output is like this:
+---+-------+-------+------+--------
| ID| NAME|college|metric| x|
+---+-------+-------+------+--------+
| 1| Amit| DU| I| <25|
| 2| Mohit| DU| I| <25|
| 3| rohith| BHU| I| 35 - 40|
| 4|sridevi| LPU| I| 30 - 35|
| 1| sravan| KLMP| M| 25 - 30|
| 5|gnanesh| IIT| M| 40 - 45|
| 5|gnadesh| KLM| c| +45|
+---+-------+-------+------+--------+
My Output should look like this
+---+-------+-------+------+--------+----+
| ID| NAME|college|metric| x| w|
+---+-------+-------+------+--------+----+
| 1| Amit| DU| I| <25|24.9|
| 2| Mohit| DU| I| <25|24.9|
| 3| rohith| BHU| I| 35 - 40| 35|
| 4|sridevi| LPU| I| 30 - 35| 30|
| 1| sravan| KLMP| M| 25 - 30| 25 |
| 5|gnanesh| IIT| M| 40 - 45| 40 |
| 5|gnadesh| KLM| c| +45| 45 |
+---+-------+-------+------+--------+----+
答案1
得分: 2
根据我理解,您对列X中的值有三个条件(如果不是这种情况,请告诉我):
- 如果值是
<X
,则新列的值将是X-0.1
。 - 如果值是
X-Y
,则新列的值将是X
。 - 如果值是
+X
,则新列的值将是'X'。
因此,下面的代码应该有效:
df.withColumn("NewColumn", \
F.when(F.col("x").contains('<'), F.split("x", "<").getItem(1) - 0.1) \
.when(F.col("x").contains('-'), F.split("x", "-").getItem(0)) \
.when(F.col("x").contains("+"), F.split("x", "\\+").getItem(1))) \
.show()
输入:
输出:
英文:
From what I understood, you have three conditions for values in column X (Let me know if this is not the case)
- If the value is
<X
then the new column value will beX-0.1
- If the value is
X-Y
then the new column value will beX
- If the value is
+X
then the new column value will be 'X'
Thus this should work:
df.withColumn("NewColumn", \
F.when(F.col("x").contains('<'), F.split("x", "<").getItem(1)-0.1)\
.when(F.col("x").contains('-'), F.split("x", "-").getItem(0))\
.when(F.col("x").contains("+"), F.split("x", "\\+").getItem(1)))\
.show()
Input:
Output:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论