英文:
Cast binary to any dtype in polars
问题
在这个问题中,您提到正在处理二进制数据并需要将其转换为特定的数据类型。您还提到在Stack Overflow上找到了一个示例,将二进制数据从二进制转换为列表(u8)。您还提到了两个问题:
问题1: 如何将十六进制字符串进行反转,例如从 "FD00FE00" 变为 "00FE00FD"?
问题2: 是否会在Polars中实现将二进制数据转换为int8、float64等其他数据类型的功能?
以下是您提到的代码部分的翻译:
# 导入必要的库
import polars as pl
# 创建二进制数据和架构
data = {
"test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']
}
schema = {
"test": pl.Binary
}
# 创建数据帧
df = pl.DataFrame(data, schema).with_row_count('id')
# 十六进制字典
hex_dict = {
"a": "10",
"b": "11",
"c": "12",
"d": "13",
"e": "14",
"f": "15"
}
# 对二进制数据进行处理
df = df.with_columns([
pl.col('test')
.bin.encode('hex')
.str.split('')
.list.slice(1, 8)
]).explode('test').with_columns([
pl.col('test').map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
]).with_row_count().with_columns([
((pl.col('row_nr').mod(8).cast(pl.Int8) - 7).abs()).alias('exp')
]).with_columns([
(pl.col('test') * pl.lit(16).pow(pl.col('exp'))).alias('value')
]).groupby('id').agg([
pl.col("value").sum().cast(pl.UInt32)
])
结果:
┌─────┬────────┐
│ id ┆ value │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪════════╡
│ 0 ┆ 658448 │
│ 1 ┆ 658432 │
└─────┴────────┘
对于您的问题1,您可以使用以下方法将十六进制字符串进行反转:
hex_str = "FD00FE00"
hex_reversed = hex_str[::-1]
对于问题2,有关Polars是否会实现将二进制数据转换为其他数据类型的功能,我无法提供确切的信息,因为我的知识截止到2021年9月,我无法获取关于Polars的最新信息。您可以查看Polars的官方文档或社区以获取更多信息。
英文:
I am working with binary data and have to cast it into specific dtypes. In this question casting from binary to list(u8) was made available. At the moment I am doing the cast for other dtypes as follows:
binary -> uint32
import polars as pl
data = {
"test": [b'\x00\x0A\x0C\x10',b'\x00\x0A\x0C\x00']
}
schema = {
"test": pl.Binary
}
df = pl.DataFrame(data, schema).with_row_count('id')
hex_dict = {
"a": "10"
,"b": "11"
,"c": "12"
,"d": "13"
,"e": "14"
,"f": "15"
}
df = df.with_columns([
pl.col('test')
.bin.encode('hex')
.str.split('')
.list.slice(1,8)
]).explode('test').with_columns([
pl.col('test').map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
]).with_row_count().with_columns([
((pl.col('row_nr').mod(8).cast(pl.Int8)-7).abs()).alias('exp')
]).with_columns([
(pl.col('test') * pl.lit(16).pow(pl.col('exp'))).alias('value')
]).groupby('id').agg([
pl.col("value").sum().cast(pl.UInt32)
])
result:
┌─────┬────────┐
│ id ┆ value │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪════════╡
│ 0 ┆ 658448 │
│ 1 ┆ 658432 │
└─────┴────────┘
Basically the binary data is encoded in hex, split into a list and then the df is exploded. The alphanumeric values are mapped, the exponent is created and in the end we calculate the resulting value. -> n1* 16⁷ + n2* 16⁶ + n3* 16⁵ + ...
Then if I would like to do it for little endian then the ordering of the bytes would have to be reversed. How would I do it in polars for a given string:
hex = "FD00FE00"
hex-reversed = "00FE00FD"
Question 1:
We can split hex into ["F","D","0","0","F","E","0","0"] and then use list.reverse(), but we would need to split hex into ["FD","00","FE","00"] and then reverse. How this could be done?
So this is only for casting into integers and it is very cumbersome.
Question 2:
Will casting (binary->int8,..float64 ever be implemented in polars, or is it out of scope?
Thanks!
答案1
得分: 1
Polars很高兴接受@juanpa.arrivillaga建议的NumPy向量:
import numpy as np
import polars as pl
data = {"test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']}
np.frombuffer(b''.join(data['test']), dtype='>u4')
array([658448, 658432], dtype=uint32)
pl.Series(list(np.frombuffer(b''.join(data['test']), dtype='>u4')), dtype=pl.UInt32)
shape: (2,)
Series: '' [u32]
[
658448
658432
]
英文:
Polars is happy to accept the numpy vector
that @juanpa.arrivillaga suggests:
>>> import numpy as np
>>> import polars as pl
>>>
>>> data = { "test": [b'\x00\x0A\x0C\x10',b'\x00\x0A\x0C\x00'] }
>>>
>>> np.frombuffer(b''.join(data['test']), dtype='>u4')
array([658448, 658432], dtype=uint32)
>>>
>>> pl.Series(list(np.frombuffer(b''.join(data['test']), dtype='>u4')), dtype=pl.UInt32)
shape: (2,)
Series: '' [u32]
[
658448
658432
]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论