2023年8月11日 03:13:03go评论145阅读模式

英文:

Cast binary to any dtype in polars

问题

在这个问题中，您提到正在处理二进制数据并需要将其转换为特定的数据类型。您还提到在Stack Overflow上找到了一个示例，将二进制数据从二进制转换为列表（u8）。您还提到了两个问题：

问题1： 如何将十六进制字符串进行反转，例如从 "FD00FE00" 变为 "00FE00FD"？

问题2： 是否会在Polars中实现将二进制数据转换为int8、float64等其他数据类型的功能？

以下是您提到的代码部分的翻译：

# 导入必要的库
import polars as pl
# 创建二进制数据和架构
data = {
    "test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']
}
schema = {
    "test": pl.Binary
}
# 创建数据帧
df = pl.DataFrame(data, schema).with_row_count('id')
# 十六进制字典
hex_dict = {
    "a": "10",
    "b": "11",
    "c": "12",
    "d": "13",
    "e": "14",
    "f": "15"
}
# 对二进制数据进行处理
df = df.with_columns([
    pl.col('test')
    .bin.encode('hex')
    .str.split('')
    .list.slice(1, 8)
]).explode('test').with_columns([
    pl.col('test').map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
]).with_row_count().with_columns([
    ((pl.col('row_nr').mod(8).cast(pl.Int8) - 7).abs()).alias('exp')
]).with_columns([
    (pl.col('test') * pl.lit(16).pow(pl.col('exp'))).alias('value')
]).groupby('id').agg([
    pl.col("value").sum().cast(pl.UInt32)
])

结果：

┌─────┬────────┐
│ id  ┆ value  │
│ --- ┆ ---    │
│ u32 ┆ u32    │
╞═════╪════════╡
│ 0   ┆ 658448 │
│ 1   ┆ 658432 │
└─────┴────────┘

对于您的问题1，您可以使用以下方法将十六进制字符串进行反转：

hex_str = "FD00FE00"
hex_reversed = hex_str[::-1]

对于问题2，有关Polars是否会实现将二进制数据转换为其他数据类型的功能，我无法提供确切的信息，因为我的知识截止到2021年9月，我无法获取关于Polars的最新信息。您可以查看Polars的官方文档或社区以获取更多信息。

英文:

I am working with binary data and have to cast it into specific dtypes. In this question casting from binary to list(u8) was made available. At the moment I am doing the cast for other dtypes as follows:

binary -> uint32

import polars as pl
data = {
	&quot;test&quot;: [b&#39;\x00\x0A\x0C\x10&#39;,b&#39;\x00\x0A\x0C\x00&#39;]
}
schema = {
	&quot;test&quot;: pl.Binary
}
df = pl.DataFrame(data, schema).with_row_count(&#39;id&#39;)
hex_dict = {
	&quot;a&quot;: &quot;10&quot;
	,&quot;b&quot;: &quot;11&quot;
	,&quot;c&quot;: &quot;12&quot;
	,&quot;d&quot;: &quot;13&quot;
	,&quot;e&quot;: &quot;14&quot;
	,&quot;f&quot;: &quot;15&quot;
}
df = df.with_columns([
      pl.col(&#39;test&#39;)
	.bin.encode(&#39;hex&#39;)
	.str.split(&#39;&#39;)
	.list.slice(1,8)
   ]).explode(&#39;test&#39;).with_columns([
      pl.col(&#39;test&#39;).map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
   ]).with_row_count().with_columns([
      ((pl.col(&#39;row_nr&#39;).mod(8).cast(pl.Int8)-7).abs()).alias(&#39;exp&#39;)
   ]).with_columns([
      (pl.col(&#39;test&#39;) * pl.lit(16).pow(pl.col(&#39;exp&#39;))).alias(&#39;value&#39;)
   ]).groupby(&#39;id&#39;).agg([
	pl.col(&quot;value&quot;).sum().cast(pl.UInt32)
   ])

result:

┌─────┬────────┐
│ id  ┆ value  │
│ --- ┆ ---    │
│ u32 ┆ u32    │
╞═════╪════════╡
│ 0   ┆ 658448 │
│ 1   ┆ 658432 │
└─────┴────────┘

Basically the binary data is encoded in hex, split into a list and then the df is exploded. The alphanumeric values are mapped, the exponent is created and in the end we calculate the resulting value. -> n1* 16⁷ + n2* 16⁶ + n3* 16⁵ + ...

Then if I would like to do it for little endian then the ordering of the bytes would have to be reversed. How would I do it in polars for a given string:

hex = &quot;FD00FE00&quot;
hex-reversed = &quot;00FE00FD&quot;

Question 1:
We can split hex into ["F","D","0","0","F","E","0","0"] and then use list.reverse(), but we would need to split hex into ["FD","00","FE","00"] and then reverse. How this could be done?

So this is only for casting into integers and it is very cumbersome.

Question 2:
Will casting (binary->int8,..float64 ever be implemented in polars, or is it out of scope?

Thanks!

答案1

得分: 1

Polars很高兴接受@juanpa.arrivillaga建议的NumPy向量：

import numpy as np
import polars as pl
data = {"test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']}
np.frombuffer(b''.join(data['test']), dtype='>u4')
array([658448, 658432], dtype=uint32)
pl.Series(list(np.frombuffer(b''.join(data['test']), dtype='>u4')), dtype=pl.UInt32)
shape: (2,)
Series: '' [u32]
[
	658448
	658432
]

英文:

Polars is happy to accept the numpy vector
that @juanpa.arrivillaga suggests:

&gt;&gt;&gt; import numpy as np
&gt;&gt;&gt; import polars as pl
&gt;&gt;&gt; 
&gt;&gt;&gt; data = { &quot;test&quot;: [b&#39;\x00\x0A\x0C\x10&#39;,b&#39;\x00\x0A\x0C\x00&#39;] }
&gt;&gt;&gt; 
&gt;&gt;&gt; np.frombuffer(b&#39;&#39;.join(data[&#39;test&#39;]), dtype=&#39;&gt;u4&#39;)
array([658448, 658432], dtype=uint32)
&gt;&gt;&gt; 
&gt;&gt;&gt; pl.Series(list(np.frombuffer(b&#39;&#39;.join(data[&#39;test&#39;]), dtype=&#39;&gt;u4&#39;)), dtype=pl.UInt32)
shape: (2,)
Series: &#39;&#39; [u32]
[
	658448
	658432
]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将二进制数据转换为任何数据类型在 Polars 中。

问题

答案1

“pip program issue in python” 可以翻译为 “Python 中的 pip 程序问题”。

Python代码未遍历超过一个项目的列表项。

为什么签名的验证不同？

Kalman Filtering in Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。