英文:
creating a new column in polars applying a function to a column
问题
I have the following code for manipulating a polars dataframe that does not work
import polars as pl
# create a sample dataframe
dfpl = pl.DataFrame({
'A': [1, 2, 3],
'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})
def func(mystring):
return mystring*2
def func2(xml_string):
root = ET.fromstring(xml_string)
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
return test_list
# create a sample series to add as a new column
dfpl=dfpl.with_columns([(pl.col("A").map(lambda x: func(x)).alias('new_col'))])
dfpl=dfpl.with_columns([(pl.col("B").map(lambda x: func2(x)).alias('new_col2'))])
print(dfpl)
The first line for adding a column works, i.e. adding new_col
but the second one does not work.
The error that I get is:
ComputeError: TypeError: a bytes-like object is required, not 'Series'
Basically the use case that I have iss that a column contains XML string that I have to manipulate creating a XML object and extracting information.
How can I proceed?
英文:
I have the following code for manipulating a polars dataframe that does not work
import polars as pl
# create a sample dataframe
dfpl = pl.DataFrame({
'A': [1, 2, 3],
'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})
def func(mystring):
return mystring*2
def func2(xml_string):
root = ET.fromstring(xml_string)
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
return test_list
# create a sample series to add as a new column
dfpl=dfpl.with_columns([(pl.col("A").map(lambda x: func(x)).alias('new_col'))])
dfpl=dfpl.with_columns([(pl.col("B").map(lambda x: func2(x)).alias('new_col2'))])
print(dfpl)
The first line for adding a column works, i.e. adding new_col
but the second one does not work.
The error that I get is:
ComputeError: TypeError: a bytes-like object is required, not 'Series'
Basically the use case that I have iss that a column contains XML string that I have to manipulate creating a XML object and extracting information.
How can I proceed?
答案1
得分: 2
你的第一个示例有效,因为 *2
是矢量化的。
例如,如果你执行以下操作:
func(pl.Series([1,2,3,4,5]))
那么你将得到原始系列乘以2的结果。
而你的 func2
不是矢量化的。要使用 map
,你的函数需要在整个列上操作并返回类似 Series 的结果。
例如:
from lxml import etree as ET
def func2_series(xml_strings):
ret_List=[]
for xml_string in xml_strings:
root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
ret_List.append(text_list)
return pl.Series(ret_List)
然后执行:
dfpl.with_columns(pl.col("B").map(func2_series).alias('new_col2'))
将有效。
另外,如果你有以下函数:
def func2(xml_string):
root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
return text_list
你可以使用 apply
,而 Polars 将为你执行循环:
dfpl.with_columns(pl.col("B").apply(func2))
顺便说一下,如果你传递的函数接受与你已经有的 x
相同的参数,就不需要使用 lambda。换句话说,在你有 .map(lambda x: func2(x))
的地方,你可以直接使用 .map(func2)
。Lambda 函数在需要转换参数时才会发挥作用。
英文:
Your first example works because *2
is vectorized.
For example if you do
func(pl.Series([1,2,3,4,5]))
then you get back a series of the original multiplied by 2.
Your func2
isn't vectorized. To use map
then your function needs to operate on the entire column and return something like a Series.
For instance:
from lxml import etree as ET
def func2_series(xml_strings):
ret_List=[]
for xml_string in xml_strings:
root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
ret_List.append(text_list)
return pl.Series(ret_List)
followed by
dfpl.with_columns(pl.col("B").map(func2_series).alias('new_col2'))
will work.
Alternatively if you have
def func2(xml_string):
root = ET.fromstring(xml_string, ET.XMLParser(recover=True))
text_list = []
for elem in root.iter():
text = elem.text.strip() if elem.text else ''
text_list.append(text)
return text_list
then you can use apply and polars will do the looping for you.
dfpl.with_columns(pl.col("B").apply(func2))
btw, you don't need to use a lambda if the function you're passing accepts the exact x
that you have. In other words where you have .map(lambda x: func2(x))
you can just do .map(func2)
. The lambda comes into play if you need to transform the parameters.
答案2
得分: 1
如评论中所述,请使用 .apply
而不是 .map
。此外,如果您只想要字符串列表,我建议使用 beautifulsoup
的 .stripped_strings
方法:
import polars as pl
from bs4 import BeautifulSoup
# 创建一个示例数据框
dfpl = pl.DataFrame({
'A': [1, 2, 3],
'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})
def func(mystring):
return mystring*2
def func2(xml_string):
soup = BeautifulSoup(xml_string, 'html.parser')
return list(soup.stripped_strings)
# 创建一个示例系列以添加为新列
dfpl = dfpl.with_columns([(pl.col("A").apply(lambda x: func(x)).alias('new_col'))])
dfpl = dfpl.with_columns([(pl.col("B").apply(lambda x: func2(x)).alias('new_col2'))])
print(dfpl)
打印结果:
shape: (3, 4)
┌─────┬────────────────────────────┬─────────┬──────────────────────┐
│ A ┆ B ┆ new_col ┆ new_col2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ list[str] │
╞═════╪════════════════════════════╪═════════╪══════════════════════╡
│ 1 ┆ <p>some text</p><p>bla</p> ┆ 2 ┆ ["some text", "bla"] │
│ 2 ┆ <p>some text<p><p>foo</p> ┆ 4 ┆ ["some text", "foo"] │
│ 3 ┆ <p>some text<p> ┆ 6 ┆ ["some text"] │
└─────┴────────────────────────────┴─────────┴──────────────────────┘
希望这有助于您的项目!
英文:
As stated in the comments, use .apply
instead of .map
. Also, if you want only list of strings I recommend to use beautifulsoup
s method .stripped_strings
:
import polars as pl
from bs4 import BeautifulSoup
# create a sample dataframe
dfpl = pl.DataFrame({
'A': [1, 2, 3],
'B': ['<p>some text</p><p>bla</p>', '<p>some text<p><p>foo</p>', '<p>some text<p>']
})
def func(mystring):
return mystring*2
def func2(xml_string):
soup = BeautifulSoup(xml_string, 'html.parser')
return list(soup.stripped_strings)
# create a sample series to add as a new column
dfpl=dfpl.with_columns([(pl.col("A").apply(lambda x: func(x)).alias('new_col'))])
dfpl=dfpl.with_columns([(pl.col("B").apply(lambda x: func2(x)).alias('new_col2'))])
print(dfpl)
Prints:
shape: (3, 4)
┌─────┬────────────────────────────┬─────────┬──────────────────────┐
│ A ┆ B ┆ new_col ┆ new_col2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ list[str] │
╞═════╪════════════════════════════╪═════════╪══════════════════════╡
│ 1 ┆ <p>some text</p><p>bla</p> ┆ 2 ┆ ["some text", "bla"] │
│ 2 ┆ <p>some text<p><p>foo</p> ┆ 4 ┆ ["some text", "foo"] │
│ 3 ┆ <p>some text<p> ┆ 6 ┆ ["some text"] │
└─────┴────────────────────────────┴─────────┴──────────────────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论