2023年5月10日 23:31:33go评论70阅读模式

英文:

creating a new column in polars applying a function to a column

问题

I have the following code for manipulating a polars dataframe that does not work

import polars as pl

# create a sample dataframe
dfpl = pl.DataFrame({
    'A': [1, 2, 3],
    'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})

def func(mystring):
    return mystring*2

def func2(xml_string):
    root = ET.fromstring(xml_string)
    text_list = []
    for elem in root.iter():
        text = elem.text.strip() if elem.text else ''
        text_list.append(text)
    return test_list

# create a sample series to add as a new column
dfpl=dfpl.with_columns([(pl.col("A").map(lambda x: func(x)).alias('new_col'))])
dfpl=dfpl.with_columns([(pl.col("B").map(lambda x: func2(x)).alias('new_col2'))])

print(dfpl)

The first line for adding a column works, i.e. adding new_col

but the second one does not work.

The error that I get is:
ComputeError: TypeError: a bytes-like object is required, not 'Series'

Basically the use case that I have iss that a column contains XML string that I have to manipulate creating a XML object and extracting information.

How can I proceed?

英文:

I have the following code for manipulating a polars dataframe that does not work

import polars as pl

# create a sample dataframe
dfpl = pl.DataFrame({
    &#39;A&#39;: [1, 2, 3],
    &#39;B&#39;: [&#39;&lt;p&gt;some text&lt;/p&gt;&lt;p&gt;bla&lt;/p&gt;&#39;, &#39;&lt;p&gt;some text&lt;p&gt;&lt;p&gt;foo&lt;/p&gt;&#39;, &#39;&lt;p&gt;some text&lt;p&gt;&#39;]
})

def func(mystring):
    return mystring*2

def func2(xml_string):
    root = ET.fromstring(xml_string)
    text_list = []
    for elem in root.iter():
        text = elem.text.strip() if elem.text else &#39;&#39;
        text_list.append(text)
    return test_list

# create a sample series to add as a new column
dfpl=dfpl.with_columns([(pl.col(&quot;A&quot;).map(lambda x: func(x)).alias(&#39;new_col&#39;))])
dfpl=dfpl.with_columns([(pl.col(&quot;B&quot;).map(lambda x: func2(x)).alias(&#39;new_col2&#39;))])

print(dfpl)

The first line for adding a column works, i.e. adding new_col

but the second one does not work.

The error that I get is:
ComputeError: TypeError: a bytes-like object is required, not 'Series'

Basically the use case that I have iss that a column contains XML string that I have to manipulate creating a XML object and extracting information.

How can I proceed?

答案1

得分: 2

你的第一个示例有效，因为 *2 是矢量化的。

例如，如果你执行以下操作：

func(pl.Series([1,2,3,4,5]))

那么你将得到原始系列乘以2的结果。

而你的 func2 不是矢量化的。要使用 map，你的函数需要在整个列上操作并返回类似 Series 的结果。

例如：

from lxml import etree as ET
def func2_series(xml_strings):
    ret_List=[]
    for xml_string in xml_strings:
        root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
        text_list = []
        for elem in root.iter():
            text = elem.text.strip() if elem.text else ''
            text_list.append(text)
        ret_List.append(text_list)
    return pl.Series(ret_List)

然后执行：

dfpl.with_columns(pl.col("B").map(func2_series).alias('new_col2'))

将有效。

另外，如果你有以下函数：

def func2(xml_string):
    root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
    text_list = []
    for elem in root.iter():
        text = elem.text.strip() if elem.text else ''
        text_list.append(text)
    return text_list

你可以使用 apply，而 Polars 将为你执行循环：

dfpl.with_columns(pl.col("B").apply(func2))

顺便说一下，如果你传递的函数接受与你已经有的 x 相同的参数，就不需要使用 lambda。换句话说，在你有 .map(lambda x: func2(x)) 的地方，你可以直接使用 .map(func2)。Lambda 函数在需要转换参数时才会发挥作用。

英文:

Your first example works because *2 is vectorized.

For example if you do

func(pl.Series([1,2,3,4,5]))

then you get back a series of the original multiplied by 2.

Your func2 isn't vectorized. To use map then your function needs to operate on the entire column and return something like a Series.

For instance:

from lxml import etree as ET
def func2_series(xml_strings):
    ret_List=[]
    for xml_string in xml_strings:
        root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
        text_list = []
        for elem in root.iter():
            text = elem.text.strip() if elem.text else &#39;&#39;
            text_list.append(text)
        ret_List.append(text_list)
    return pl.Series(ret_List)

followed by

dfpl.with_columns(pl.col(&quot;B&quot;).map(func2_series).alias(&#39;new_col2&#39;))

will work.

Alternatively if you have

def func2(xml_string):
    root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
    text_list = []
    for elem in root.iter():
        text = elem.text.strip() if elem.text else &#39;&#39;
        text_list.append(text)
    return text_list

then you can use apply and polars will do the looping for you.

dfpl.with_columns(pl.col(&quot;B&quot;).apply(func2))

btw, you don't need to use a lambda if the function you're passing accepts the exact x that you have. In other words where you have .map(lambda x: func2(x)) you can just do .map(func2). The lambda comes into play if you need to transform the parameters.

答案2

得分: 1

如评论中所述，请使用 .apply 而不是 .map。此外，如果您只想要字符串列表，我建议使用 beautifulsoup 的 .stripped_strings 方法：

import polars as pl
from bs4 import BeautifulSoup

# 创建一个示例数据框
dfpl = pl.DataFrame({
    'A': [1, 2, 3],
    'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})

def func(mystring):
    return mystring*2

def func2(xml_string):
    soup = BeautifulSoup(xml_string, 'html.parser')
    return list(soup.stripped_strings)

# 创建一个示例系列以添加为新列
dfpl = dfpl.with_columns([(pl.col("A").apply(lambda x: func(x)).alias('new_col'))])
dfpl = dfpl.with_columns([(pl.col("B").apply(lambda x: func2(x)).alias('new_col2'))])

print(dfpl)

打印结果：

shape: (3, 4)
┌─────┬────────────────────────────┬─────────┬──────────────────────┐
│ A   ┆ B                          ┆ new_col ┆ new_col2             │
│ --- ┆ ---                        ┆ ---     ┆ ---                  │
│ i64 ┆ str                        ┆ i64     ┆ list[str]            │
╞═════╪════════════════════════════╪═════════╪══════════════════════╡
│ 1   ┆ <p>some text</p><p>bla</p>  ┆ 2       ┆ ["some text", "bla"] │
│ 2   ┆ <p>some text<p><p>foo</p>   ┆ 4       ┆ ["some text", "foo"] │
│ 3   ┆ <p>some text<p>             ┆ 6       ┆ ["some text"]        │
└─────┴────────────────────────────┴─────────┴──────────────────────┘

希望这有助于您的项目！

英文:

As stated in the comments, use .apply instead of .map. Also, if you want only list of strings I recommend to use beautifulsoups method .stripped_strings:

import polars as pl
from bs4 import BeautifulSoup

# create a sample dataframe
dfpl = pl.DataFrame({
    &#39;A&#39;: [1, 2, 3],
    &#39;B&#39;: [&#39;&lt;p&gt;some text&lt;/p&gt;&lt;p&gt;bla&lt;/p&gt;&#39;, &#39;&lt;p&gt;some text&lt;p&gt;&lt;p&gt;foo&lt;/p&gt;&#39;, &#39;&lt;p&gt;some text&lt;p&gt;&#39;]
})

def func(mystring):
    return mystring*2

def func2(xml_string):
    soup = BeautifulSoup(xml_string, &#39;html.parser&#39;)
    return list(soup.stripped_strings)

# create a sample series to add as a new column
dfpl=dfpl.with_columns([(pl.col(&quot;A&quot;).apply(lambda x: func(x)).alias(&#39;new_col&#39;))])
dfpl=dfpl.with_columns([(pl.col(&quot;B&quot;).apply(lambda x: func2(x)).alias(&#39;new_col2&#39;))])

print(dfpl)

Prints:

shape: (3, 4)
┌─────┬────────────────────────────┬─────────┬──────────────────────┐
│ A   ┆ B                          ┆ new_col ┆ new_col2             │
│ --- ┆ ---                        ┆ ---     ┆ ---                  │
│ i64 ┆ str                        ┆ i64     ┆ list[str]            │
╞═════╪════════════════════════════╪═════════╪══════════════════════╡
│ 1   ┆ &lt;p&gt;some text&lt;/p&gt;&lt;p&gt;bla&lt;/p&gt; ┆ 2       ┆ [&quot;some text&quot;, &quot;bla&quot;] │
│ 2   ┆ &lt;p&gt;some text&lt;p&gt;&lt;p&gt;foo&lt;/p&gt;  ┆ 4       ┆ [&quot;some text&quot;, &quot;foo&quot;] │
│ 3   ┆ &lt;p&gt;some text&lt;p&gt;            ┆ 6       ┆ [&quot;some text&quot;]        │
└─────┴────────────────────────────┴─────────┴──────────────────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Polars中创建一个新列，将函数应用于一个列。

问题

答案1

答案2

如何在Pyspark DataFrame中选择日期范围

Discord bot.py将不会在使用await bot.start()时运行。

Would df.sort_values('A', kind = 'mergesort').sort_index(kind = 'mergesort') be a stable and valid way to sort by index and column?

如何在Python中找到两个集合进行AND操作的余数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论