将二进制数据转换为任何数据类型在 Polars 中。

huangapple go评论145阅读模式
英文:

Cast binary to any dtype in polars

问题

在这个问题中,您提到正在处理二进制数据并需要将其转换为特定的数据类型。您还提到在Stack Overflow上找到了一个示例,将二进制数据从二进制转换为列表(u8)。您还提到了两个问题:

问题1: 如何将十六进制字符串进行反转,例如从 "FD00FE00" 变为 "00FE00FD"?

问题2: 是否会在Polars中实现将二进制数据转换为int8、float64等其他数据类型的功能?

以下是您提到的代码部分的翻译:

  1. # 导入必要的库
  2. import polars as pl
  3. # 创建二进制数据和架构
  4. data = {
  5. "test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']
  6. }
  7. schema = {
  8. "test": pl.Binary
  9. }
  10. # 创建数据帧
  11. df = pl.DataFrame(data, schema).with_row_count('id')
  12. # 十六进制字典
  13. hex_dict = {
  14. "a": "10",
  15. "b": "11",
  16. "c": "12",
  17. "d": "13",
  18. "e": "14",
  19. "f": "15"
  20. }
  21. # 对二进制数据进行处理
  22. df = df.with_columns([
  23. pl.col('test')
  24. .bin.encode('hex')
  25. .str.split('')
  26. .list.slice(1, 8)
  27. ]).explode('test').with_columns([
  28. pl.col('test').map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
  29. ]).with_row_count().with_columns([
  30. ((pl.col('row_nr').mod(8).cast(pl.Int8) - 7).abs()).alias('exp')
  31. ]).with_columns([
  32. (pl.col('test') * pl.lit(16).pow(pl.col('exp'))).alias('value')
  33. ]).groupby('id').agg([
  34. pl.col("value").sum().cast(pl.UInt32)
  35. ])

结果:

  1. ┌─────┬────────┐
  2. id value
  3. --- ---
  4. u32 u32
  5. ╞═════╪════════╡
  6. 0 658448
  7. 1 658432
  8. └─────┴────────┘

对于您的问题1,您可以使用以下方法将十六进制字符串进行反转:

  1. hex_str = "FD00FE00"
  2. hex_reversed = hex_str[::-1]

对于问题2,有关Polars是否会实现将二进制数据转换为其他数据类型的功能,我无法提供确切的信息,因为我的知识截止到2021年9月,我无法获取关于Polars的最新信息。您可以查看Polars的官方文档或社区以获取更多信息。

英文:

I am working with binary data and have to cast it into specific dtypes. In this question casting from binary to list(u8) was made available. At the moment I am doing the cast for other dtypes as follows:

binary -> uint32

  1. import polars as pl
  2. data = {
  3. "test": [b'\x00\x0A\x0C\x10',b'\x00\x0A\x0C\x00']
  4. }
  5. schema = {
  6. "test": pl.Binary
  7. }
  8. df = pl.DataFrame(data, schema).with_row_count('id')
  9. hex_dict = {
  10. "a": "10"
  11. ,"b": "11"
  12. ,"c": "12"
  13. ,"d": "13"
  14. ,"e": "14"
  15. ,"f": "15"
  16. }
  17. df = df.with_columns([
  18. pl.col('test')
  19. .bin.encode('hex')
  20. .str.split('')
  21. .list.slice(1,8)
  22. ]).explode('test').with_columns([
  23. pl.col('test').map_dict(hex_dict, default=pl.first()).cast(pl.Int8)
  24. ]).with_row_count().with_columns([
  25. ((pl.col('row_nr').mod(8).cast(pl.Int8)-7).abs()).alias('exp')
  26. ]).with_columns([
  27. (pl.col('test') * pl.lit(16).pow(pl.col('exp'))).alias('value')
  28. ]).groupby('id').agg([
  29. pl.col("value").sum().cast(pl.UInt32)
  30. ])

result:

  1. ┌─────┬────────┐
  2. id value
  3. --- ---
  4. u32 u32
  5. ╞═════╪════════╡
  6. 0 658448
  7. 1 658432
  8. └─────┴────────┘

Basically the binary data is encoded in hex, split into a list and then the df is exploded. The alphanumeric values are mapped, the exponent is created and in the end we calculate the resulting value. -> n1* 16⁷ + n2* 16⁶ + n3* 16⁵ + ...

Then if I would like to do it for little endian then the ordering of the bytes would have to be reversed. How would I do it in polars for a given string:

  1. hex = "FD00FE00"
  2. hex-reversed = "00FE00FD"

Question 1:
We can split hex into ["F","D","0","0","F","E","0","0"] and then use list.reverse(), but we would need to split hex into ["FD","00","FE","00"] and then reverse. How this could be done?

So this is only for casting into integers and it is very cumbersome.

Question 2:
Will casting (binary->int8,..float64 ever be implemented in polars, or is it out of scope?

Thanks!

答案1

得分: 1

Polars很高兴接受@juanpa.arrivillaga建议的NumPy向量:

  1. import numpy as np
  2. import polars as pl
  3. data = {"test": [b'\x00\x0A\x0C\x10', b'\x00\x0A\x0C\x00']}
  4. np.frombuffer(b''.join(data['test']), dtype='>u4')
  5. array([658448, 658432], dtype=uint32)
  6. pl.Series(list(np.frombuffer(b''.join(data['test']), dtype='>u4')), dtype=pl.UInt32)
  7. shape: (2,)
  8. Series: '' [u32]
  9. [
  10. 658448
  11. 658432
  12. ]
英文:

Polars is happy to accept the numpy vector
that @juanpa.arrivillaga suggests:

  1. >>> import numpy as np
  2. >>> import polars as pl
  3. >>>
  4. >>> data = { "test": [b'\x00\x0A\x0C\x10',b'\x00\x0A\x0C\x00'] }
  5. >>>
  6. >>> np.frombuffer(b''.join(data['test']), dtype='>u4')
  7. array([658448, 658432], dtype=uint32)
  8. >>>
  9. >>> pl.Series(list(np.frombuffer(b''.join(data['test']), dtype='>u4')), dtype=pl.UInt32)
  10. shape: (2,)
  11. Series: '' [u32]
  12. [
  13. 658448
  14. 658432
  15. ]

huangapple
  • 本文由 发表于 2023年8月11日 03:13:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878712.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定