自定义解码二进制数据在Polars中

huangapple go评论76阅读模式
英文:

Custom decode binary data in polars

问题

当处理二进制数据时,我使用自定义函数来解码它们。这需要在Polars中使用apply函数。由于在这种情况下进行元素级处理,处理大型数据集时计算时间显著增加。

我尝试将二进制数据转换为List(UInt8),但这尚未实现。

exceptions.ArrowErrorException: NotYetImplemented("Casting from LargeBinary to LargeList(Field { name: \"item\", data_type: UInt8, is_nullable: true, metadata: {} }) not supported")

有没有更有效的方法?

import polars as pl
import struct
import io

data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00', b'\x10\x00\x20\x00\x30\x00'], "id": [1, 2]}
schema = {"binary": pl.Binary, "id": pl.Int16}

df = pl.DataFrame(data, schema)

这将返回:

shape: (2, 2)
┌───────────────┬─────┐
 binary         id  
 ---            --- 
 binary         i16 
╞═══════════════╪═════╡
 [binary data]  1   
 [binary data]  2   
└───────────────┴─────┘

现在,当我们应用我们的函数来解码二进制列时:

def custom_decode(data):
   bytestream = io.BytesIO(data)
   lst = []

   while bytestream.tell() < 6:
      lst.append(struct.unpack('<H', bytestream.read(2))[0])

   return lst

df = df.with_columns([
      pl.col('binary').apply(lambda x: custom_decode(x))
   ])

结果:

shape: (2, 2)
┌─────────────────┬─────┐
 binary           id  
 ---              --- 
 list[i64]        i16 
╞═════════════════╪═════╡
 [253, 254, 255]  1   
 [16, 32, 48]     2   
└─────────────────┴─────┘
英文:

when working with binary data, I am using custom function in order to decode them. This requires the usage of apply in polars. Due to the element wise processing in this case, the calculation time is increasing significantly, when working with larg data sets.

I tried to cast the binary data to List(UInt8), but this is not yet implemented.

exceptions.ArrowErrorException: NotYetImplemented(&quot;Casting from LargeBinary to LargeList(Field { name: \&quot;item\&quot;, data_type: UInt8, is_nullable: true, metadata: {} }) not supported&quot;)

Is there a more efficiant way of doing it?

import polars as pl
import struct
import io

data = {&quot;binary&quot;: [b&#39;\xFD\x00\xFE\x00\xFF\x00&#39;,b&#39;\x10\x00\x20\x00\x30\x00&#39;], &quot;id&quot;: [1,2]}
schema = {&quot;binary&quot;: pl.Binary, &quot;id&quot;:pl.Int16}

df = pl.DataFrame(data, schema)

This returns:

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ binary        ┆ i16 │
╞═══════════════╪═════╡
│ [binary data] ┆ 1   │
│ [binary data] ┆ 2   │
└───────────────┴─────┘

Now when we apply our function to decode the binary column:

def custom_decode(data):
   bytestream = io.BytesIO(data)
   lst = []

   while bytestream.tell() &lt; 6:
      lst.append(struct.unpack(&#39;&lt;H&#39;, bytestream.read(2))[0])

   return lst

df = df.with_columns([
      pl.col(&#39;binary&#39;).apply(lambda x: custom_decode(x))
   ])

Result:

shape: (2, 2)
┌─────────────────┬─────┐
│ binary          ┆ id  │
│ ---             ┆ --- │
│ list[i64]       ┆ i16 │
╞═════════════════╪═════╡
│ [253, 254, 255] ┆ 1   │
│ [16, 32, 48]    ┆ 2   │
└─────────────────┴─────┘

答案1

得分: 2

我已经在上游添加了类型转换。在下一个版本的 Polars(polars>=0.18.1)中,您可以这样做:

data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00', b'\x10\x00\x20\x00\x30\x00'], "id": [1, 2]}
schema = {"binary": pl.Binary, "id": pl.Int16}

(
    pl.DataFrame(data, schema)
    .with_columns(
        pl.col("binary").cast(pl.List(pl.UInt8))
    )
)
shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ list[u8]      ┆ i16 │
╞═══════════════╪═════╡
│ [253, 0, … 0] ┆ 1   │
│ [16, 0, … 0]  ┆ 2   │
└───────────────┴─────┘

英文:

I have added the cast upstream. In the next release of polars; polars&gt;=0.18.1, you can do this:

data = {&quot;binary&quot;: [b&#39;\xFD\x00\xFE\x00\xFF\x00&#39;,b&#39;\x10\x00\x20\x00\x30\x00&#39;], &quot;id&quot;: [1,2]}
schema = {&quot;binary&quot;: pl.Binary, &quot;id&quot;:pl.Int16}

(
    pl.DataFrame(data, schema)
    .with_columns(
        pl.col(&quot;binary&quot;).cast(pl.List(pl.UInt8))
    )
)
shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ list[u8]      ┆ i16 │
╞═══════════════╪═════╡
│ [253, 0, … 0] ┆ 1   │
│ [16, 0, … 0]  ┆ 2   │
└───────────────┴─────┘

答案2

得分: 1

以下是您要翻译的内容:

  1. 而不是使用while,可以使用列表推导,因为您知道要重复3次。这里的改进在于列表的大小将提前分配,而不是后续追加。在列表上进行追加比使用预分配的列表性能更差。

  2. 在您的自定义函数内进行循环,并使其返回一个Series,这样您可以使用map而不是apply(不过我不确定polars循环与Python循环的相对开销是多少)。

您可以像这样编写您的函数:

def custom_decode(data):
    retL = [None] * len(data)
    for i, datum in enumerate(data):
        bytestream = io.BytesIO(datum)
        retL[i] = [struct.unpack('<H', bytestream.read(2))[0] for _ in range(3)]
    return(pl.Series(retL))

然后执行:

df.with_columns([
    pl.col('binary').map(custom_decode)
])
英文:

A couple ideas that might make marginal improvements.

  1. instead of the while, use a list comprehension since you know you're doing that 3 times. The improvement here is that the list size will be allocated in advance instead of appending. Appending to lists is more expensive than starting with preallocated ones.

  2. do the looping inside your custom function and have it return a Series so you can use map instead of apply (I'm not sure what the relative overhead of polars looping vs python looping is though)

You could write your function like this

def custom_decode(data):
    retL=[None] * len(data)
    for i, datum in enumerate(data):
        bytestream = io.BytesIO(datum)
        retL[i]=[struct.unpack(&#39;&lt;H&#39;, bytestream.read(2))[0] for _ in range(3)]
    return(pl.Series(retL))

and then do

df.with_columns([
    pl.col(&#39;binary&#39;).map(custom_decode)
])

huangapple
  • 本文由 发表于 2023年6月1日 15:41:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76379674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定