如何按名称对未绑定的PySpark列列表进行排序?

huangapple go评论72阅读模式
英文:

How to sort unbound list of PySpark columns by name?

问题

这似乎应该很简单,但出于某种原因,我感到困惑。我有一个 PySpark 列的列表,我想按名称对其进行排序(包括别名,因为这将是它们显示/写入磁盘的方式)。以下是一些示例测试和我尝试过的方法:

def test_col_sorting():
    from pyspark.sql import SparkSession
    import pyspark.sql.functions as f

    # 需要活动的 Spark 上下文
    spark = SparkSession.builder.getOrCreate()

    # 要排序的数据
    cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]

    # 尝试 1
    result = sorted(cols)
    # 这会失败,抛出 ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

    # 尝试 2
    result = sorted(cols, key=lambda x: x.name())
    # 出现相同的原因而失败,'name()' 返回一个 Column 对象,而不是一个字符串

    # 我希望成立的断言:
    assert result == [f.col('a'), f.col('c'), f.col('b').alias('z')]

有没有合理的方法来从用于初始化它的 Column 对象中实际获取字符串(但也尊重别名)?如果我可以从对象中获取这个字符串,我就可以将它用作键。

请注意,我并不是想要对 DataFrame 上的列进行排序,就像在这个问题中回答的那样:https://stackoverflow.com/questions/42912156/python-pyspark-data-frame-rearrange-columns。这些 Column 对象没有绑定到任何 DataFrame。我也不想根据列的值对列进行排序。

英文:

This seems like it should be pretty simple, but I'm stumped for some reason. I have a list of PySpark columns that I would like to sort by name (including aliasing, as that will be how they are displayed/written to disk). Here's some example tests and things I've tried:

def test_col_sorting():
    from pyspark.sql import SparkSession
    import pyspark.sql.functions as f

    # Active spark context needed
    spark = SparkSession.builder.getOrCreate()

    # Data to sort
    cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]

    # Attempt 1
    result = sorted(cols)
    # This fails with ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

    # Attempt 2
    result = sorted(cols, key=lambda x: x.name())
    # Fails for the same reason, `name()` returns a Column object, not a string

    # Assertion I want to hold true:
    assert result = [f.col('a'), f.col('c'), f.col('b').alias('z')]

Is there any reasonable way to actually get the string back out of the Column object that was used to initialize it (but also respecting aliasing)? If I could get this from the object I could use it as a key.

Note that I am NOT looking to sort the columns on a DataFrame, as answered in this question: https://stackoverflow.com/questions/42912156/python-pyspark-data-frame-rearrange-columns. These Column objects are not bound to any DataFrame. I also do not want to sort the column based on the values of the column.

答案1

得分: 1

Answering my own question: it seems that you can't do this without some amount of parsing from the column string representation. You also don't need regex to handle this. These two methods should take care of it:

def get_column_name(col: Column) -> str:
    """
    PySpark doesn't allow you to directly access the column name with respect to aliases
    from an unbound column. We have to parse this out from the string representation.

    This works on columns with one or more aliases as well as unaliased columns.

    Returns:
        Col name as str, with respect to aliasing
    """
    c = str(col).lstrip("Column<").rstrip(">")
    return c.split(' AS ')[-1]


def sorted_columns(cols: List[Column]) -> List[Column]:
    """
    Returns a sorted list of columns, with respect to aliases
    Args:
        cols: List of PySpark Columns (e.g., [f.col('a'), f.col('b').alias('c'), ...])

    Returns:
        Sorted list of PySpark Columns by name, with respect to aliasing
    """
    return sorted(cols, key=lambda x: get_column_name(x))

Some tests to validate behavior:

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark() -> SparkSession:
    # Provide a session spark fixture for all tests
    yield SparkSession.builder.getOrCreate()

def test_get_col_name(spark: SparkSession):
    col = f.col('a')
    actual = get_column_name(col)
    assert actual == 'a';


def test_get_col_name_alias(spark: SparkSession):
    col = f.col('a').alias('b')
    actual = get_column_name(col)
    assert actual == 'b';


def test_get_col_name_multiple_alias(spark: SparkSession):
    col = f.col('a').alias('b').alias('c')
    actual = get_column_name(col)
    assert actual == 'c';


def test_sorted_columns(spark: SparkSession):
    cols = [f.col('z').alias('c'), f.col('a'), f.col('d').alias('e').alias('f'), f.col('b')]
    actual = sorted_columns(cols)
    expected = [f.col('a'), f.col('b'), f.col('z').alias('c'), f.col('d').alias('e').alias('f')]

    # We can't directly compare lists of cols, so we zip and check the repr of each element
    for a, b in zip(actual, expected):
        assert str(a) == str(b)

I think it's fair to say being unable to access this information in a truthy way is a failure of the PySpark API. There are a multitude of valid reasons to want to ascertain what name an unbound Column type would be resolved to, and it should not have to be parsed in such a hacky way.

英文:

Answering my own question: it seems that you can't do this without some amount of parsing from the column string representation. You also don't need regex to handle this. These two methods should take care of it:

def get_column_name(col: Column) -&gt; str:
    &quot;&quot;&quot;
    PySpark doesn&#39;t allow you to directly access the column name with respect to aliases
    from an unbound column. We have to parse this out from the string representation.

    This works on columns with one or more aliases as well as unaliased columns.

    Returns:
        Col name as str, with respect to aliasing
    &quot;&quot;&quot;
    c = str(col).lstrip(&quot;Column&lt;&#39;&quot;).rstrip(&quot;&#39;&gt;&quot;)
    return c.split(&#39; AS &#39;)[-1]


def sorted_columns(cols: List[Column]) -&gt; List[Column]:
    &quot;&quot;&quot;
    Returns sorted list of columns, with respect to aliases
    Args:
        cols: List of PySpark Columns (e.g. [f.col(&#39;a&#39;), f.col(&#39;b&#39;).alias(&#39;c&#39;), ...])

    Returns:
        Sorted list of PySpark Columns by name, with respect to aliasing
    &quot;&quot;&quot;
    return sorted(cols, key=lambda x: get_column_name(x))

Some tests to validate behavior:

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope=&quot;session&quot;)
def spark() -&gt; SparkSession:
    # Provide a session spark fixture for all tests
    yield SparkSession.builder.getOrCreate()

def test_get_col_name(spark: SparkSession):
    col = f.col(&#39;a&#39;)
    actual = get_column_name(col)
    assert actual == &#39;a&#39;


def test_get_col_name_alias(spark: SparkSession):
    col = f.col(&#39;a&#39;).alias(&#39;b&#39;)
    actual = get_column_name(col)
    assert actual == &#39;b&#39;


def test_get_col_name_multiple_alias(spark: SparkSession):
    col = f.col(&#39;a&#39;).alias(&#39;b&#39;).alias(&#39;c&#39;)
    actual = get_column_name(col)
    assert actual == &#39;c&#39;


def test_sorted_columns(spark: SparkSession):
    cols = [f.col(&#39;z&#39;).alias(&#39;c&#39;), f.col(&#39;a&#39;), f.col(&#39;d&#39;).alias(&#39;e&#39;).alias(&#39;f&#39;), f.col(&#39;b&#39;)]
    actual = sorted_columns(cols)
    expected = [f.col(&#39;a&#39;), f.col(&#39;b&#39;), f.col(&#39;z&#39;).alias(&#39;c&#39;), f.col(&#39;d&#39;).alias(&#39;e&#39;).alias(&#39;f&#39;)]

    # We can&#39;t directly compare lists of cols, so we zip and check the repr of each element
    for a, b in zip(actual, expected):
        assert str(a) == str(b)

I think it's fair to say being unable to access this information in a truthy way is a failure of the PySpark API. There are a multitude of valid reasons to want to ascertain what name an unbound Column type would be resolved to, and it should not have to be parsed in such a hacky way.

答案2

得分: 0

如果您只对获取列名并对其进行排序(与任何数据无关)感兴趣,您可以使用列对象的 __repr__ 方法并使用正则表达式提取列的实际名称。

所以对于这些列:

import pyspark.sql.functions as f
cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]

您可以这样做:

import re

# 创建列的字符串表示的列表
col_repr = [x.__repr__() for x in cols]
["Column&lt;'c'&gt;", "Column&lt;'a'&gt;", "Column&lt;'b AS z'&gt;"]

# 使用正则表达式提取列名的有趣部分,同时确保正确获取别名。注意,
# 我们在 `b AS z` 中获取列名的正确部分
col_names = [re.search('([a-zA-Z])&gt;', x).group(1) for x in col_repr]
['c', 'a', 'z']

# 对此数组进行排序
sorted_col_names = sorted(col_names)
['a', 'c', 'z']

注意:此示例很简单(仅接受小写和大写字母作为列名),但随着列名变得更加复杂,只需调整正则表达式模式即可。

英文:

If you're only interested in grabbing the column names, and sorting those (without any relation to any data) you can use the column object's __repr__ method and use regex to extract the actual name of your column.

So for these columns

import pyspark.sql.functions as f
cols = [f.col(&#39;c&#39;), f.col(&#39;a&#39;), f.col(&#39;b&#39;).alias(&#39;z&#39;)]

You could do this:

import re

# Making a list of string representation of our columns
col_repr = [x.__repr__() for x in cols]
[&quot;Column&lt;&#39;c&#39;&gt;&quot;, &quot;Column&lt;&#39;a&#39;&gt;&quot;, &quot;Column&lt;&#39;b AS z&#39;&gt;&quot;]

# Using regex to extract the interesting part of the column name
# while making sure we&#39;re properly grabbing the alias name. Notice
# that we&#39;re grabbing the right part of the column name in `b AS z`
col_names = [re.search(&#39;([a-zA-Z])\&#39;&gt;&#39;, x).group(1) for x in col_repr]
[&#39;c&#39;, &#39;a&#39;, &#39;z&#39;]

# Sorting this array
sorted_col_names = sorted(col_names)
[&#39;a&#39;, &#39;c&#39;, &#39;z&#39;]

NOTE: This example is simple (only accepting lowercase and uppercase letters as column names) but as your column names get more complex, it's just a question of adapting your regex pattern.

huangapple
  • 本文由 发表于 2023年1月6日 10:51:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75026483.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定