如何在PySpark中访问DataFrame列并进行字符串比较?

huangapple go评论49阅读模式
英文:

How to access dataframe column in pyspark and do string comparision?

问题

以下是您要求的代码部分的翻译:

我有一个Python函数根据数据框的列的值返回True/False

def check_name(df):
  if ((df.name == "ABC")):
      return ((df.Value < 0.80))

    return (df.Value == 0)

然后我将这个函数作为`myFunction`传递给我的查询

def myQuery(myFunction):
    df.filter(...).groupBy(...).withColumn('Result', when(myFunction(df), 0).otherwise(1))

但它失败了

无法将列转换为布尔值请在构建DataFrame布尔表达式时使用'&'表示'and''|'表示'or''~'表示'not'

我认为问题在于这个`df.name == "ABC"`。

我尝试将其更改为`F.col('name')` == "ABC"但是我得到了相同的错误

您能告诉我如何解决我的问题吗
英文:

I have a python function which return True/False depends on value of a data frame column.

def check_name(df):
  if ((df.name == &quot;ABC&quot;)):
      return ((df.Value &lt; 0.80))

    return (df.Value == 0)

And I pass this function into my query as myFunction:

def myQuery(myFunction):
    df.filter(...).groupBy(...).withColumn(&#39;Result&#39;, when(myFunction(df), 0).otherwise(1))

But it fails

Cannot convert column into bool: please use &#39;&amp;&#39; for &#39;and&#39;, &#39;|&#39; for &#39;or&#39;, &#39;~&#39; for &#39;not&#39; when building DataFrame boolean expressions.

I think the problem is this df.name == &quot;ABC&quot;

I have tried changing to F.col(&#39;name&#39;) == "ABC", but I get the same error.

Can you please tell me how to fix my issue?

答案1

得分: 0

if-else 代码应该变成 Spark 中的指令 (when.otherwise)。

def check_name(df):
    return F.when(df.id == "ABC", df.score1 < 0.80).otherwise(df.score1 == 0)

然后,如果 myFunction 必须返回布尔值,并且您要反转布尔值(true = 0, false = 1),您可以简化 myQuery 如下:

def myQuery(myFunction):
    return (df.filter(...)
            .groupBy(...)
            .withColumn('Result', (~myFunction(df).cast('int')))
英文:

if-else code should be instructions (when.otherwise) in spark.

def check_name(df):
    return F.when(df.id == &quot;ABC&quot;, df.score1 &lt; 0.80).otherwise(df.score1 == 0)

and then if myFunction must return boolean and you are inverting the boolean value (true = 0, false = 1), you can simplify the myQuery to be

def myQuery(myFunction):
    return (df.filter(...)
            .groupBy(...)
            .withColumn(&#39;Result&#39;, (~myFunction(df).cast(&#39;int&#39;)))

huangapple
  • 本文由 发表于 2023年6月8日 05:08:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427121.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定