将二进制数据转换为任何数据类型在 Polars 中。

huangapple go评论116阅读模式
英文:

Cast binary to any dtype in polars

问题

在这个问题中,您提到正在处理二进制数据并需要将其转换为特定的数据类型。您还提到在Stack Overflow上找到了一个示例,将二进制数据从二进制转换为列表(u8)。您还提到了两个问题:

问题1: 如何将十六进制字符串进行反转,例如从 "FD00FE00" 变为 "00FE00FD"?

问题2: 是否会在Polars中实现将二进制数据转换为int8、float64等其他数据类型的功能?

以下是您提到的代码部分的翻译:

# 导入必要的库
import polars as pl

# 创建二进制数据和架构
data = {
    "test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']
}
schema = {
    "test": pl.Binary
}

# 创建数据帧
df = pl.DataFrame(data, schema).with_row_count('id')

# 十六进制字典
hex_dict = {
    "a": "10",
    "b": "11",
    "c": "12",
    "d": "13",
    "e": "14",
    "f": "15"
}

# 对二进制数据进行处理
df = df.with_columns([
    pl.col('test')
    .bin.encode('hex')
    .str.split('')
    .list.slice(1, 8)
]).explode('test').with_columns([
    pl.col('test').map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
]).with_row_count().with_columns([
    ((pl.col('row_nr').mod(8).cast(pl.Int8) - 7).abs()).alias('exp')
]).with_columns([
    (pl.col('test') * pl.lit(16).pow(pl.col('exp'))).alias('value')
]).groupby('id').agg([
    pl.col("value").sum().cast(pl.UInt32)
])

结果:

┌─────┬────────┐
│ id  ┆ value  │
│ --- ┆ ---    │
│ u32 ┆ u32    │
╞═════╪════════╡
│ 0   ┆ 658448 │
│ 1   ┆ 658432 │
└─────┴────────┘

对于您的问题1,您可以使用以下方法将十六进制字符串进行反转:

hex_str = "FD00FE00"
hex_reversed = hex_str[::-1]

对于问题2,有关Polars是否会实现将二进制数据转换为其他数据类型的功能,我无法提供确切的信息,因为我的知识截止到2021年9月,我无法获取关于Polars的最新信息。您可以查看Polars的官方文档或社区以获取更多信息。

英文:

I am working with binary data and have to cast it into specific dtypes. In this question casting from binary to list(u8) was made available. At the moment I am doing the cast for other dtypes as follows:

binary -> uint32

import polars as pl

data = {
	"test": [b'\x00\x0A\x0C\x10',b'\x00\x0A\x0C\x00']
}
schema = {
	"test": pl.Binary
}

df = pl.DataFrame(data, schema).with_row_count('id')

hex_dict = {
	"a": "10"
	,"b": "11"
	,"c": "12"
	,"d": "13"
	,"e": "14"
	,"f": "15"
}

df = df.with_columns([
      pl.col('test')
	.bin.encode('hex')
	.str.split('')
	.list.slice(1,8)
   ]).explode('test').with_columns([
      pl.col('test').map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
   ]).with_row_count().with_columns([
      ((pl.col('row_nr').mod(8).cast(pl.Int8)-7).abs()).alias('exp')
   ]).with_columns([
      (pl.col('test') * pl.lit(16).pow(pl.col('exp'))).alias('value')
   ]).groupby('id').agg([
	pl.col("value").sum().cast(pl.UInt32)
   ]) 

result:

┌─────┬────────┐
│ id  ┆ value  │
│ --- ┆ ---    │
│ u32 ┆ u32    │
╞═════╪════════╡
│ 0   ┆ 658448 │
│ 1   ┆ 658432 │
└─────┴────────┘

Basically the binary data is encoded in hex, split into a list and then the df is exploded. The alphanumeric values are mapped, the exponent is created and in the end we calculate the resulting value. -> n1* 16⁷ + n2* 16⁶ + n3* 16⁵ + ...

Then if I would like to do it for little endian then the ordering of the bytes would have to be reversed. How would I do it in polars for a given string:

hex = "FD00FE00"
hex-reversed = "00FE00FD"

Question 1:
We can split hex into ["F","D","0","0","F","E","0","0"] and then use list.reverse(), but we would need to split hex into ["FD","00","FE","00"] and then reverse. How this could be done?

So this is only for casting into integers and it is very cumbersome.

Question 2:
Will casting (binary->int8,..float64 ever be implemented in polars, or is it out of scope?

Thanks!

答案1

得分: 1

Polars很高兴接受@juanpa.arrivillaga建议的NumPy向量:

import numpy as np
import polars as pl

data = {"test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']}

np.frombuffer(b''.join(data['test']), dtype='>u4')
array([658448, 658432], dtype=uint32)

pl.Series(list(np.frombuffer(b''.join(data['test']), dtype='>u4')), dtype=pl.UInt32)
shape: (2,)
Series: '' [u32]
[
	658448
	658432
]
英文:

Polars is happy to accept the numpy vector
that @juanpa.arrivillaga suggests:

>>> import numpy as np
>>> import polars as pl
>>> 
>>> data = { "test": [b'\x00\x0A\x0C\x10',b'\x00\x0A\x0C\x00'] }
>>> 
>>> np.frombuffer(b''.join(data['test']), dtype='>u4')
array([658448, 658432], dtype=uint32)
>>> 
>>> pl.Series(list(np.frombuffer(b''.join(data['test']), dtype='>u4')), dtype=pl.UInt32)
shape: (2,)
Series: '' [u32]
[
	658448
	658432
]

huangapple
  • 本文由 发表于 2023年8月11日 03:13:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878712.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定