2023年6月1日 14:15:04go评论136阅读模式

英文:

Write and read a pyarrow schema from file

问题

I'm transforming 120 JSON tables (of type List[Dict] in python in-memory) of varying schemata to Arrow to write it to .parquet files on ADLS, utilizing the pyarrow package.

我正在将120个不同架构的JSON表格（在内存中以List[Dict]类型表示）转换为Arrow，以便将其写入ADLS上的.parquet文件，利用pyarrow包。

I want to store the schema of each table in a separate file so I don't have to hardcode it for the 120 tables. As I iterate over my tables, I want to load each schema from file and transform the JSON data to Arrow by passing the schema.

我希望将每个表格的架构存储在单独的文件中，这样我就不必为这120个表格硬编码它。当我遍历我的表格时，我想从文件中加载每个架构，并通过传递架构将JSON数据转换为Arrow。

import pyarrow as pa
data = [
    {"col1": 1, "col2": "a"},
    {"col1": 2, "col2": "b"},
    {"col1": 3, "col2": "c"},
    {"col1": 4, "col2": "d"},
    {"col1": 5, "col2": "e"}
]
# How to load the schema from file and parse it into a `pa.schema`?
# 如何从文件加载架构并将其解析为`pa.schema`？
my_schema = pa.schema([
    pa.field('year', pa.int64()),
    pa.field('somthing', pa.string())
])
arrow_table = pa.Table.from_pylist(data, schema=my_schema)
# How to write this schema to file?
# 如何将这个架构写入文件？
arrow_table.schema

I could write a custom file format for the schema and write a parser that reads the (e.g. txt) file, transforming its content into the pa.datatype() stuff, but I hope there is an easier, "official" solution to this?

我可以为架构编写自定义文件格式，并编写一个解析器，用于读取（例如txt）文件，将其内容转换为pa.datatype()的内容，但我希望有一个更简单、更“官方”的解决方案。

英文:

I'm transforming 120 JSON tables (of type List[Dict] in python in-memory) of varying schemata to Arrow to write it to .parquet files on ADLS, utilizing the pyarrow package.

import pyarrow as pa
data = [
    {&quot;col1&quot;: 1, &quot;col2&quot;: &quot;a&quot;},
    {&quot;col1&quot;: 2, &quot;col2&quot;: &quot;b&quot;},
    {&quot;col1&quot;: 3, &quot;col2&quot;: &quot;c&quot;},
    {&quot;col1&quot;: 4, &quot;col2&quot;: &quot;d&quot;},
    {&quot;col1&quot;: 5, &quot;col2&quot;: &quot;e&quot;}
]
# How to load the schema from file and parse it into a `pa.schema`?
my_schema = pa.schema([
    pa.field(&#39;year&#39;, pa.int64()),
    pa.field(&#39;somthing&#39;, pa.string())]
)
arrow_table = pa.Table.from_pylist(data, schema=my_schema)
# How to write this schema to file? 
arrow_table.schema

答案1

得分: 1

你可以使用 pyarrow.parquet.write_metadata 存储元数据，并使用 pyarrow.parquet.read_schema 读取它回来。

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"col1": [1,2,3]})
pq.write_metadata(table.schema, "table.metadata")
schema = pq.read_schema("table.metadata")

英文:

You can store the meta data using pyarrow.parquet.write_metadata and read it back using pyarrow.parquet.read_schema

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({&quot;col1&quot;: [1,2,3]})
pq.write_metadata(table.schema, &quot;table.metadata&quot;)
schema = pq.read_schema(&quot;table.metadata&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

写入和读取一个来自文件的 pyarrow 模式。

问题

答案1

AttributeError: ‘Game’对象没有’tries_left’属性。

Tesseract为何在这里返回错误的数字？

Telegram bot function with event handler not executing before program ends; how to fix and print results properly?

什么函数最适合我从星系中获得的数据？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。