英文:
Custom decode binary data in polars
问题
当处理二进制数据时,我使用自定义函数来解码它们。这需要在Polars中使用apply
函数。由于在这种情况下进行元素级处理,处理大型数据集时计算时间显著增加。
我尝试将二进制数据转换为List(UInt8)
,但这尚未实现。
exceptions.ArrowErrorException: NotYetImplemented("Casting from LargeBinary to LargeList(Field { name: \"item\", data_type: UInt8, is_nullable: true, metadata: {} }) not supported")
有没有更有效的方法?
import polars as pl
import struct
import io
data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00', b'\x10\x00\x20\x00\x30\x00'], "id": [1, 2]}
schema = {"binary": pl.Binary, "id": pl.Int16}
df = pl.DataFrame(data, schema)
这将返回:
shape: (2, 2)
┌───────────────┬─────┐
│ binary ┆ id │
│ --- ┆ --- │
│ binary ┆ i16 │
╞═══════════════╪═════╡
│ [binary data] ┆ 1 │
│ [binary data] ┆ 2 │
└───────────────┴─────┘
现在,当我们应用我们的函数来解码二进制列时:
def custom_decode(data):
bytestream = io.BytesIO(data)
lst = []
while bytestream.tell() < 6:
lst.append(struct.unpack('<H', bytestream.read(2))[0])
return lst
df = df.with_columns([
pl.col('binary').apply(lambda x: custom_decode(x))
])
结果:
shape: (2, 2)
┌─────────────────┬─────┐
│ binary ┆ id │
│ --- ┆ --- │
│ list[i64] ┆ i16 │
╞═════════════════╪═════╡
│ [253, 254, 255] ┆ 1 │
│ [16, 32, 48] ┆ 2 │
└─────────────────┴─────┘
英文:
when working with binary data, I am using custom function in order to decode them. This requires the usage of apply in polars. Due to the element wise processing in this case, the calculation time is increasing significantly, when working with larg data sets.
I tried to cast the binary data to List(UInt8), but this is not yet implemented.
exceptions.ArrowErrorException: NotYetImplemented("Casting from LargeBinary to LargeList(Field { name: \"item\", data_type: UInt8, is_nullable: true, metadata: {} }) not supported")
Is there a more efficiant way of doing it?
import polars as pl
import struct
import io
data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00',b'\x10\x00\x20\x00\x30\x00'], "id": [1,2]}
schema = {"binary": pl.Binary, "id":pl.Int16}
df = pl.DataFrame(data, schema)
This returns:
shape: (2, 2)
┌───────────────┬─────┐
│ binary ┆ id │
│ --- ┆ --- │
│ binary ┆ i16 │
╞═══════════════╪═════╡
│ [binary data] ┆ 1 │
│ [binary data] ┆ 2 │
└───────────────┴─────┘
Now when we apply our function to decode the binary column:
def custom_decode(data):
bytestream = io.BytesIO(data)
lst = []
while bytestream.tell() < 6:
lst.append(struct.unpack('<H', bytestream.read(2))[0])
return lst
df = df.with_columns([
pl.col('binary').apply(lambda x: custom_decode(x))
])
Result:
shape: (2, 2)
┌─────────────────┬─────┐
│ binary ┆ id │
│ --- ┆ --- │
│ list[i64] ┆ i16 │
╞═════════════════╪═════╡
│ [253, 254, 255] ┆ 1 │
│ [16, 32, 48] ┆ 2 │
└─────────────────┴─────┘
答案1
得分: 2
我已经在上游添加了类型转换。在下一个版本的 Polars(polars>=0.18.1
)中,您可以这样做:
data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00', b'\x10\x00\x20\x00\x30\x00'], "id": [1, 2]}
schema = {"binary": pl.Binary, "id": pl.Int16}
(
pl.DataFrame(data, schema)
.with_columns(
pl.col("binary").cast(pl.List(pl.UInt8))
)
)
shape: (2, 2)
┌───────────────┬─────┐
│ binary ┆ id │
│ --- ┆ --- │
│ list[u8] ┆ i16 │
╞═══════════════╪═════╡
│ [253, 0, … 0] ┆ 1 │
│ [16, 0, … 0] ┆ 2 │
└───────────────┴─────┘
英文:
I have added the cast upstream. In the next release of polars; polars>=0.18.1
, you can do this:
data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00',b'\x10\x00\x20\x00\x30\x00'], "id": [1,2]}
schema = {"binary": pl.Binary, "id":pl.Int16}
(
pl.DataFrame(data, schema)
.with_columns(
pl.col("binary").cast(pl.List(pl.UInt8))
)
)
shape: (2, 2)
┌───────────────┬─────┐
│ binary ┆ id │
│ --- ┆ --- │
│ list[u8] ┆ i16 │
╞═══════════════╪═════╡
│ [253, 0, … 0] ┆ 1 │
│ [16, 0, … 0] ┆ 2 │
└───────────────┴─────┘
答案2
得分: 1
以下是您要翻译的内容:
-
而不是使用
while
,可以使用列表推导,因为您知道要重复3次。这里的改进在于列表的大小将提前分配,而不是后续追加。在列表上进行追加比使用预分配的列表性能更差。 -
在您的自定义函数内进行循环,并使其返回一个Series,这样您可以使用
map
而不是apply
(不过我不确定polars循环与Python循环的相对开销是多少)。
您可以像这样编写您的函数:
def custom_decode(data):
retL = [None] * len(data)
for i, datum in enumerate(data):
bytestream = io.BytesIO(datum)
retL[i] = [struct.unpack('<H', bytestream.read(2))[0] for _ in range(3)]
return(pl.Series(retL))
然后执行:
df.with_columns([
pl.col('binary').map(custom_decode)
])
英文:
A couple ideas that might make marginal improvements.
-
instead of the while, use a list comprehension since you know you're doing that 3 times. The improvement here is that the list size will be allocated in advance instead of appending. Appending to lists is more expensive than starting with preallocated ones.
-
do the looping inside your custom function and have it return a Series so you can use map instead of apply (I'm not sure what the relative overhead of polars looping vs python looping is though)
You could write your function like this
def custom_decode(data):
retL=[None] * len(data)
for i, datum in enumerate(data):
bytestream = io.BytesIO(datum)
retL[i]=[struct.unpack('<H', bytestream.read(2))[0] for _ in range(3)]
return(pl.Series(retL))
and then do
df.with_columns([
pl.col('binary').map(custom_decode)
])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论