英文:
Write and read a pyarrow schema from file
问题
I'm transforming 120 JSON tables (of type List[Dict]
in python in-memory) of varying schemata to Arrow
to write it to .parquet
files on ADLS, utilizing the pyarrow
package.
我正在将120个不同架构的JSON表格(在内存中以List[Dict]
类型表示)转换为Arrow
,以便将其写入ADLS上的.parquet
文件,利用pyarrow
包。
I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.
我希望将每个表格的架构存储在单独的文件中,这样我就不必为这120个表格硬编码它。当我遍历我的表格时,我想从文件中加载每个架构,并通过传递架构将JSON数据转换为Arrow。
import pyarrow as pa
data = [
{"col1": 1, "col2": "a"},
{"col1": 2, "col2": "b"},
{"col1": 3, "col2": "c"},
{"col1": 4, "col2": "d"},
{"col1": 5, "col2": "e"}
]
# How to load the schema from file and parse it into a `pa.schema`?
# 如何从文件加载架构并将其解析为`pa.schema`?
my_schema = pa.schema([
pa.field('year', pa.int64()),
pa.field('somthing', pa.string())
])
arrow_table = pa.Table.from_pylist(data, schema=my_schema)
# How to write this schema to file?
# 如何将这个架构写入文件?
arrow_table.schema
I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype()
stuff, but I hope there is an easier, "official" solution to this?
我可以为架构编写自定义文件格式,并编写一个解析器,用于读取(例如txt)文件,将其内容转换为pa.datatype()
的内容,但我希望有一个更简单、更“官方”的解决方案。
英文:
I'm transforming 120 JSON tables (of type List[Dict]
in python in-memory) of varying schemata to Arrow
to write it to .parquet
files on ADLS, utilizing the pyarrow
package.
I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.
import pyarrow as pa
data = [
{"col1": 1, "col2": "a"},
{"col1": 2, "col2": "b"},
{"col1": 3, "col2": "c"},
{"col1": 4, "col2": "d"},
{"col1": 5, "col2": "e"}
]
# How to load the schema from file and parse it into a `pa.schema`?
my_schema = pa.schema([
pa.field('year', pa.int64()),
pa.field('somthing', pa.string())]
)
arrow_table = pa.Table.from_pylist(data, schema=my_schema)
# How to write this schema to file?
arrow_table.schema
I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype()
stuff, but I hope there is an easier, "official" solution to this?
答案1
得分: 1
你可以使用 pyarrow.parquet.write_metadata 存储元数据,并使用 pyarrow.parquet.read_schema 读取它回来。
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"col1": [1,2,3]})
pq.write_metadata(table.schema, "table.metadata")
schema = pq.read_schema("table.metadata")
英文:
You can store the meta data using pyarrow.parquet.write_metadata and read it back using pyarrow.parquet.read_schema
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"col1": [1,2,3]})
pq.write_metadata(table.schema, "table.metadata")
schema = pq.read_schema("table.metadata")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论