写入和读取一个来自文件的 pyarrow 模式。

huangapple go评论94阅读模式
英文:

Write and read a pyarrow schema from file

问题

I'm transforming 120 JSON tables (of type List[Dict] in python in-memory) of varying schemata to Arrow to write it to .parquet files on ADLS, utilizing the pyarrow package.

我正在将120个不同架构的JSON表格(在内存中以List[Dict]类型表示)转换为Arrow,以便将其写入ADLS上的.parquet文件,利用pyarrow包。

I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.

我希望将每个表格的架构存储在单独的文件中,这样我就不必为这120个表格硬编码它。当我遍历我的表格时,我想从文件中加载每个架构,并通过传递架构将JSON数据转换为Arrow。

import pyarrow as pa

data = [
    {"col1": 1, "col2": "a"},
    {"col1": 2, "col2": "b"},
    {"col1": 3, "col2": "c"},
    {"col1": 4, "col2": "d"},
    {"col1": 5, "col2": "e"}
]

# How to load the schema from file and parse it into a `pa.schema`?
# 如何从文件加载架构并将其解析为`pa.schema`?
my_schema = pa.schema([
    pa.field('year', pa.int64()),
    pa.field('somthing', pa.string())
])
arrow_table = pa.Table.from_pylist(data, schema=my_schema)

# How to write this schema to file?
# 如何将这个架构写入文件?
arrow_table.schema

I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype() stuff, but I hope there is an easier, "official" solution to this?

我可以为架构编写自定义文件格式,并编写一个解析器,用于读取(例如txt)文件,将其内容转换为pa.datatype()的内容,但我希望有一个更简单、更“官方”的解决方案。

英文:

I'm transforming 120 JSON tables (of type List[Dict] in python in-memory) of varying schemata to Arrow to write it to .parquet files on ADLS, utilizing the pyarrow package.

I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.

import pyarrow as pa

data = [
    {"col1": 1, "col2": "a"},
    {"col1": 2, "col2": "b"},
    {"col1": 3, "col2": "c"},
    {"col1": 4, "col2": "d"},
    {"col1": 5, "col2": "e"}
]

# How to load the schema from file and parse it into a `pa.schema`?
my_schema = pa.schema([
    pa.field('year', pa.int64()),
    pa.field('somthing', pa.string())]
)
arrow_table = pa.Table.from_pylist(data, schema=my_schema)

# How to write this schema to file? 
arrow_table.schema

I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype() stuff, but I hope there is an easier, "official" solution to this?

答案1

得分: 1

你可以使用 pyarrow.parquet.write_metadata 存储元数据,并使用 pyarrow.parquet.read_schema 读取它回来。

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({"col1": [1,2,3]})

pq.write_metadata(table.schema, "table.metadata")
schema = pq.read_schema("table.metadata")
英文:

You can store the meta data using pyarrow.parquet.write_metadata and read it back using pyarrow.parquet.read_schema

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({"col1": [1,2,3]})

pq.write_metadata(table.schema, "table.metadata")
schema = pq.read_schema("table.metadata")

huangapple
  • 本文由 发表于 2023年6月1日 14:15:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76379139.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定