写入和读取一个来自文件的 pyarrow 模式。

huangapple go评论136阅读模式
英文:

Write and read a pyarrow schema from file

问题

I'm transforming 120 JSON tables (of type List[Dict] in python in-memory) of varying schemata to Arrow to write it to .parquet files on ADLS, utilizing the pyarrow package.

我正在将120个不同架构的JSON表格(在内存中以List[Dict]类型表示)转换为Arrow,以便将其写入ADLS上的.parquet文件,利用pyarrow包。

I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.

我希望将每个表格的架构存储在单独的文件中,这样我就不必为这120个表格硬编码它。当我遍历我的表格时,我想从文件中加载每个架构,并通过传递架构将JSON数据转换为Arrow。

  1. import pyarrow as pa
  2. data = [
  3. {"col1": 1, "col2": "a"},
  4. {"col1": 2, "col2": "b"},
  5. {"col1": 3, "col2": "c"},
  6. {"col1": 4, "col2": "d"},
  7. {"col1": 5, "col2": "e"}
  8. ]
  9. # How to load the schema from file and parse it into a `pa.schema`?
  10. # 如何从文件加载架构并将其解析为`pa.schema`?
  11. my_schema = pa.schema([
  12. pa.field('year', pa.int64()),
  13. pa.field('somthing', pa.string())
  14. ])
  15. arrow_table = pa.Table.from_pylist(data, schema=my_schema)
  16. # How to write this schema to file?
  17. # 如何将这个架构写入文件?
  18. arrow_table.schema

I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype() stuff, but I hope there is an easier, "official" solution to this?

我可以为架构编写自定义文件格式,并编写一个解析器,用于读取(例如txt)文件,将其内容转换为pa.datatype()的内容,但我希望有一个更简单、更“官方”的解决方案。

英文:

I'm transforming 120 JSON tables (of type List[Dict] in python in-memory) of varying schemata to Arrow to write it to .parquet files on ADLS, utilizing the pyarrow package.

I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.

  1. import pyarrow as pa
  2. data = [
  3. {"col1": 1, "col2": "a"},
  4. {"col1": 2, "col2": "b"},
  5. {"col1": 3, "col2": "c"},
  6. {"col1": 4, "col2": "d"},
  7. {"col1": 5, "col2": "e"}
  8. ]
  9. # How to load the schema from file and parse it into a `pa.schema`?
  10. my_schema = pa.schema([
  11. pa.field('year', pa.int64()),
  12. pa.field('somthing', pa.string())]
  13. )
  14. arrow_table = pa.Table.from_pylist(data, schema=my_schema)
  15. # How to write this schema to file?
  16. arrow_table.schema

I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype() stuff, but I hope there is an easier, "official" solution to this?

答案1

得分: 1

你可以使用 pyarrow.parquet.write_metadata 存储元数据,并使用 pyarrow.parquet.read_schema 读取它回来。

  1. import pyarrow as pa
  2. import pyarrow.parquet as pq
  3. table = pa.table({"col1": [1,2,3]})
  4. pq.write_metadata(table.schema, "table.metadata")
  5. schema = pq.read_schema("table.metadata")
英文:

You can store the meta data using pyarrow.parquet.write_metadata and read it back using pyarrow.parquet.read_schema

  1. import pyarrow as pa
  2. import pyarrow.parquet as pq
  3. table = pa.table({"col1": [1,2,3]})
  4. pq.write_metadata(table.schema, "table.metadata")
  5. schema = pq.read_schema("table.metadata")

huangapple
  • 本文由 发表于 2023年6月1日 14:15:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76379139.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定