2023年2月10日 16:13:01go评论64阅读模式

英文:

Pyspark: Split and conditional statements

问题

# 我尝试创建一个名为"w"的列，如果我分割值，然后创建一个条件表，如果我找到一个带有"<"符号的值，那么该值应该减去0.1。当你找到一个带有"+"时，你应该只消除+。
dataframe = dataframe.withColumn("x", split(col("x"), "-").getItem(0))

# 数据
data = [["1", "Amit", "DU", "I", "<25"],
        ["2", "Mohit", "DU", "I", "<25"],
        ["3", "rohith", "BHU", "I", "35-40"],
        ["4", "sridevi", "LPU", "I", "30-35"],
        ["1", "sravan", "KLMP", "M", "25-30"],
        ["5", "gnanesh", "IIT", "M", "40-45"],
        ["5", "gnadesh", "KLM", "c", "+45"]]

# 列名
columns = ['ID', 'NAME', 'college', 'metric', 'x']

dataframe = spark.createDataFrame(data, columns)

我的输出是这样的：

+---+-------+-------+------+--------+
| ID|   NAME|college|metric|       x|
+---+-------+-------+------+--------+
|  1|   Amit|     DU|     I|     <25|
|  2|  Mohit|     DU|     I|     <25|
|  3| rohith|    BHU|     I| 35 - 40|
|  4|sridevi|    LPU|     I| 30 - 35|  
|  1| sravan|   KLMP|     M| 25 - 30|  
|  5|gnanesh|    IIT|     M| 40 - 45|  
|  5|gnadesh|    KLM|     c|     +45| 
+---+-------+-------+------+--------+

我的输出应该看起来像这样：

+---+-------+-------+------+--------+----+
| ID|   NAME|college|metric|       x|   w|
+---+-------+-------+------+--------+----+
|  1|   Amit|     DU|     I|     <25|24.9|
|  2|  Mohit|     DU|     I|     <25|24.9|
|  3| rohith|    BHU|     I| 35 - 40|  35|
|  4|sridevi|    LPU|     I| 30 - 35|  30|
|  1| sravan|   KLMP|     M| 25 - 30|  25 | 
|  5|gnanesh|    IIT|     M| 40 - 45|  40 | 
|  5|gnadesh|    KLM|     c|     +45|  45 |
+---+-------+-------+------+--------+----+

英文:

I try to create a column called "w" in which If I split the values and then I create a conditional table in which If I find a value with the "<" smybol then that value should be substracted -0.1. When you find a value with "+" when you just should eliminate the +.

I tried this the split but I need to write the conditions.

Tahnk you for your help


dataframe = dataframe.withColumn(&quot;x&quot;, split(col(&quot;x&quot;), &quot;-&quot;).getItem(0))

data = [[&quot;1&quot;, &quot;Amit&quot;, &quot;DU&quot;, &quot;I&quot;, &quot;&lt;25&quot;],
        [&quot;2&quot;, &quot;Mohit&quot;, &quot;DU&quot;, &quot;I&quot;, &quot;&lt;25&quot;],
        [&quot;3&quot;, &quot;rohith&quot;, &quot;BHU&quot;, &quot;I&quot;, 35-40],
        [&quot;4&quot;, &quot;sridevi&quot;, &quot;LPU&quot;, &quot;I&quot;, 30-35],
        [&quot;1&quot;, &quot;sravan&quot;, &quot;KLMP&quot;, &quot;M&quot;, 25-30],
        [&quot;5&quot;, &quot;gnanesh&quot;, &quot;IIT&quot;, &quot;M&quot;, 40-45],
       [&quot;5&quot;, &quot;gnadesh&quot;, &quot;KLM&quot;, &quot;c&quot;, &quot;+45&quot;]]

columns = [&#39;ID&#39;, &#39;NAME&#39;, &#39;college&#39;, &#39;metric&#39;, &#39;x&#39;]


dataframe = spark.createDataFrame(data, columns)

My output is like this:

+---+-------+-------+------+--------
| ID|   NAME|college|metric|       x| 
+---+-------+-------+------+--------+
|  1|   Amit|     DU|     I|     &lt;25|
|  2|  Mohit|     DU|     I|     &lt;25|
|  3| rohith|    BHU|     I| 35 - 40|
|  4|sridevi|    LPU|     I| 30 - 35|  
|  1| sravan|   KLMP|     M| 25 - 30|  
|  5|gnanesh|    IIT|     M| 40 - 45|  
|  5|gnadesh|    KLM|     c|     +45| 
+---+-------+-------+------+--------+

My Output should look like this

+---+-------+-------+------+--------+----+
| ID|   NAME|college|metric|       x|   w|
+---+-------+-------+------+--------+----+
|  1|   Amit|     DU|     I|     &lt;25|24.9|
|  2|  Mohit|     DU|     I|     &lt;25|24.9|
|  3| rohith|    BHU|     I| 35 - 40|  35|
|  4|sridevi|    LPU|     I| 30 - 35|  30|
|  1| sravan|   KLMP|     M| 25 - 30| 25 | 
|  5|gnanesh|    IIT|     M| 40 - 45| 40 | 
|  5|gnadesh|    KLM|     c|     +45| 45 |
+---+-------+-------+------+--------+----+

答案1

得分: 2

根据我理解，您对列X中的值有三个条件（如果不是这种情况，请告诉我）：

如果值是<X，则新列的值将是X-0.1。
如果值是X-Y，则新列的值将是X。
如果值是+X，则新列的值将是'X'。

因此，下面的代码应该有效：

df.withColumn("NewColumn", \
    F.when(F.col("x").contains('<'), F.split("x", "<").getItem(1) - 0.1) \
    .when(F.col("x").contains('-'), F.split("x", "-").getItem(0)) \
    .when(F.col("x").contains("+"), F.split("x", "\\+").getItem(1))) \
.show()

输入：

输出：

英文:

From what I understood, you have three conditions for values in column X (Let me know if this is not the case)

If the value is <X then the new column value will be X-0.1
If the value is X-Y then the new column value will be X
If the value is +X then the new column value will be 'X'

Thus this should work:

df.withColumn(&quot;NewColumn&quot;, \
          F.when(F.col(&quot;x&quot;).contains(&#39;&lt;&#39;), F.split(&quot;x&quot;, &quot;&lt;&quot;).getItem(1)-0.1)\
           .when(F.col(&quot;x&quot;).contains(&#39;-&#39;), F.split(&quot;x&quot;, &quot;-&quot;).getItem(0))\
           .when(F.col(&quot;x&quot;).contains(&quot;+&quot;), F.split(&quot;x&quot;, &quot;\\+&quot;).getItem(1)))\
  .show()

Input:

Output:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark: 分割和条件语句

问题

答案1

编程取消一个pyspark dataproc批处理作业

使用Spark时应使用哪个JDK？

没有从Delta表中返回数据，尽管Delta文件存在。

返回传递给函数的DataFrame实例的名称。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论