2023年6月1日 15:41:28go评论96阅读模式

英文:

Custom decode binary data in polars

问题

当处理二进制数据时，我使用自定义函数来解码它们。这需要在Polars中使用apply函数。由于在这种情况下进行元素级处理，处理大型数据集时计算时间显著增加。

我尝试将二进制数据转换为List(UInt8)，但这尚未实现。

exceptions.ArrowErrorException: NotYetImplemented("Casting from LargeBinary to LargeList(Field { name: \"item\", data_type: UInt8, is_nullable: true, metadata: {} }) not supported")

有没有更有效的方法？

import polars as pl
import struct
import io

data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00', b'\x10\x00\x20\x00\x30\x00'], "id": [1, 2]}
schema = {"binary": pl.Binary, "id": pl.Int16}

df = pl.DataFrame(data, schema)

这将返回：

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ binary        ┆ i16 │
╞═══════════════╪═════╡
│ [binary data] ┆ 1   │
│ [binary data] ┆ 2   │
└───────────────┴─────┘

现在，当我们应用我们的函数来解码二进制列时：

def custom_decode(data):
   bytestream = io.BytesIO(data)
   lst = []

   while bytestream.tell() < 6:
      lst.append(struct.unpack('<H', bytestream.read(2))[0])

   return lst

df = df.with_columns([
      pl.col('binary').apply(lambda x: custom_decode(x))
   ])

结果：

shape: (2, 2)
┌─────────────────┬─────┐
│ binary          ┆ id  │
│ ---             ┆ --- │
│ list[i64]       ┆ i16 │
╞═════════════════╪═════╡
│ [253, 254, 255] ┆ 1   │
│ [16, 32, 48]    ┆ 2   │
└─────────────────┴─────┘

英文:

when working with binary data, I am using custom function in order to decode them. This requires the usage of apply in polars. Due to the element wise processing in this case, the calculation time is increasing significantly, when working with larg data sets.

I tried to cast the binary data to List(UInt8), but this is not yet implemented.

exceptions.ArrowErrorException: NotYetImplemented(&quot;Casting from LargeBinary to LargeList(Field { name: \&quot;item\&quot;, data_type: UInt8, is_nullable: true, metadata: {} }) not supported&quot;)

Is there a more efficiant way of doing it?

import polars as pl
import struct
import io

data = {&quot;binary&quot;: [b&#39;\xFD\x00\xFE\x00\xFF\x00&#39;,b&#39;\x10\x00\x20\x00\x30\x00&#39;], &quot;id&quot;: [1,2]}
schema = {&quot;binary&quot;: pl.Binary, &quot;id&quot;:pl.Int16}

df = pl.DataFrame(data, schema)

This returns:

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ binary        ┆ i16 │
╞═══════════════╪═════╡
│ [binary data] ┆ 1   │
│ [binary data] ┆ 2   │
└───────────────┴─────┘

Now when we apply our function to decode the binary column:

def custom_decode(data):
   bytestream = io.BytesIO(data)
   lst = []

   while bytestream.tell() &lt; 6:
      lst.append(struct.unpack(&#39;&lt;H&#39;, bytestream.read(2))[0])

   return lst

df = df.with_columns([
      pl.col(&#39;binary&#39;).apply(lambda x: custom_decode(x))
   ])

Result:

shape: (2, 2)
┌─────────────────┬─────┐
│ binary          ┆ id  │
│ ---             ┆ --- │
│ list[i64]       ┆ i16 │
╞═════════════════╪═════╡
│ [253, 254, 255] ┆ 1   │
│ [16, 32, 48]    ┆ 2   │
└─────────────────┴─────┘

答案1

得分: 2

我已经在上游添加了类型转换。在下一个版本的 Polars（polars>=0.18.1）中，您可以这样做：

data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00', b'\x10\x00\x20\x00\x30\x00'], "id": [1, 2]}
schema = {"binary": pl.Binary, "id": pl.Int16}

(
    pl.DataFrame(data, schema)
    .with_columns(
        pl.col("binary").cast(pl.List(pl.UInt8))
    )
)

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ list[u8]      ┆ i16 │
╞═══════════════╪═════╡
│ [253, 0, … 0] ┆ 1   │
│ [16, 0, … 0]  ┆ 2   │
└───────────────┴─────┘

英文:

I have added the cast upstream. In the next release of polars; polars>=0.18.1, you can do this:

data = {&quot;binary&quot;: [b&#39;\xFD\x00\xFE\x00\xFF\x00&#39;,b&#39;\x10\x00\x20\x00\x30\x00&#39;], &quot;id&quot;: [1,2]}
schema = {&quot;binary&quot;: pl.Binary, &quot;id&quot;:pl.Int16}

(
    pl.DataFrame(data, schema)
    .with_columns(
        pl.col(&quot;binary&quot;).cast(pl.List(pl.UInt8))
    )
)

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ list[u8]      ┆ i16 │
╞═══════════════╪═════╡
│ [253, 0, … 0] ┆ 1   │
│ [16, 0, … 0]  ┆ 2   │
└───────────────┴─────┘

答案2

得分: 1

以下是您要翻译的内容：

而不是使用while，可以使用列表推导，因为您知道要重复3次。这里的改进在于列表的大小将提前分配，而不是后续追加。在列表上进行追加比使用预分配的列表性能更差。
在您的自定义函数内进行循环，并使其返回一个Series，这样您可以使用map而不是apply（不过我不确定polars循环与Python循环的相对开销是多少）。

您可以像这样编写您的函数：

def custom_decode(data):
    retL = [None] * len(data)
    for i, datum in enumerate(data):
        bytestream = io.BytesIO(datum)
        retL[i] = [struct.unpack('<H', bytestream.read(2))[0] for _ in range(3)]
    return(pl.Series(retL))

然后执行：

df.with_columns([
    pl.col('binary').map(custom_decode)
])

英文:

A couple ideas that might make marginal improvements.

instead of the while, use a list comprehension since you know you're doing that 3 times. The improvement here is that the list size will be allocated in advance instead of appending. Appending to lists is more expensive than starting with preallocated ones.
do the looping inside your custom function and have it return a Series so you can use map instead of apply (I'm not sure what the relative overhead of polars looping vs python looping is though)

You could write your function like this

def custom_decode(data):
    retL=[None] * len(data)
    for i, datum in enumerate(data):
        bytestream = io.BytesIO(datum)
        retL[i]=[struct.unpack(&#39;&lt;H&#39;, bytestream.read(2))[0] for _ in range(3)]
    return(pl.Series(retL))

and then do

df.with_columns([
    pl.col(&#39;binary&#39;).map(custom_decode)
])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

自定义解码二进制数据在Polars中

问题

答案1

答案2

如何在创建Conan包时导出包消费者的Conan配方？

根据N列的值计算行数。

将一个tkinter画布项目放在位于同一画布上的其他tkinter小部件的顶部？

Jax如何使用函数的LAX后端实现。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论