2023年6月16日 01:56:58go评论99阅读模式

英文:

Can't use pyarrow dictionary in scheme when appending tables to file

问题

以下是代码部分的翻译：

import pyarrow as pa
fp = pa.OSFile('my_file.arrow', 'wb')
schema = pa.schema(
    [
        ('localtime', pa.timestamp('ns')),
        ('exchtime', pa.timestamp('ns')),
        ('msgtype', pa.string()),
        ('symbol', pa.dictionary(pa.int8(), pa.string())),
        ('exch', pa.string()),
        ('price', pa.float64()),
        ('size', pa.float64()),
        ('side', pa.dictionary(pa.int8(), pa.string())),
        ('ref', pa.uint64()),
        ('oldref', pa.uint64()),
        ('mpid', pa.string()),
        ('esd', pa.dictionary(pa.int8(), pa.string())),
        ('platform', pa.dictionary(pa.int8(), pa.string())),
    ]
)
writer = pa.ipc.new_file(fp, schema)
...
for df in [... 数据帧块的列表 ...]:
    table = pa.Table.from_pandas(df, schema)
    writer.write(table)
writer.close()
fp.close()

请注意，这是您提供的代码的翻译部分，不包括问题和其他内容。如果您有任何其他需要或问题，请随时提出。

英文:

I have a very large dataset which i am attempting to process in chunks. The source data is a pandas dataframe, which I am able to read in pieces and process chunk-by-chunk. I am attempting to then write the chunks to a pyarrow feather file, appending chunks as they are processed. When writing these chunks I am using the following approach:

  import pyarrow as pa
  fp = pa.OSFile(&#39;my_file.arrow&#39;, &#39;wb&#39;)
  schema = pa.schema(
      [
          (&#39;localtime&#39;, pa.timestamp(&#39;ns&#39;)),
          (&#39;exchtime&#39;, pa.timestamp(&#39;ns&#39;)),
          (&#39;msgtype&#39;, pa.string()),
          (&#39;symbol&#39;, pa.dictionary(pa.int8(), pa.string())),
          (&#39;exch&#39;, pa.string()),
          (&#39;price&#39;, pa.float64()),
          (&#39;size&#39;, pa.float64()),
          (&#39;side&#39;, pa.dictionary(pa.int8(), pa.string())),
          (&#39;ref&#39;, pa.uint64()),
          (&#39;oldref&#39;, pa.uint64()),
          (&#39;mpid&#39;, pa.string()),
          (&#39;esd&#39;, pa.dictionary(pa.int8(), pa.string())),
          (&#39;platform&#39;, pa.dictionary(pa.int8(), pa.string())),
      ]
  )
  writer = pa.ipc.new_file(fp, schema)
  ...
  for df in [... list of dataframes chunks ...]:
    table = pa.Table.from_pandas(df, schema)
    writer.write(table)
  writer.close()
  fp.close()

This all works fine if I don't use the pa.dictionary() data type in the schema. Unfortunately, this also results in very large files, since pyarrow isn't able to index string fields with common repeating values (e.g. "symbol" in the example above has the same string in every entry; "exch" is one of ~20 values, etc).

Running the code as written above generates an error string: "Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches."

I think the issue is that, even when you use an explicit schema for converting the dataframe to a table, the dictionary mapping of strings to values might be different depending on which particular strings appear.

Is there a way to explicitly provide this dictionary mapping (of string to integer) if the possible string values can be known ahead of time?

Some things that I've tried:

Works if you don't use dictionary types
Dictionary types work if every chunk has the same value in a column, e.g. symbol
Doesn't work if you infer the schema for each chunk, rather than explicitly setting it, because schema inference is not constant across chunks

答案1

得分: 0

IpcWriteOptions 中的 emit_dictionary_deltas 必须设置为 True 才能使此功能生效。

import pyarrow as pa
fp = pa.OSFile("my_file.arrow", "wb")
schema = pa.schema(
    [
        ("symbol", pa.dictionary(pa.int8(), pa.string())),
    ]
)
options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
writer = pa.ipc.new_file(fp, schema, options=options)
writer.write(pa.table({"symbol": ["A", "B"]}, schema=schema))
writer.write(pa.table({"symbol": ["A", "B", "C"]}, schema=schema))

我认为默认情况下未启用此选项的原因是它可能会导致与不支持它的其他实现的兼容性问题。

英文:

emit_dictionary_deltas in IpcWriteOptions must be set to True in order for this to work.

import pyarrow as pa
fp = pa.OSFile(&quot;my_file.arrow&quot;, &quot;wb&quot;)
schema = pa.schema(
    [
        (&quot;symbol&quot;, pa.dictionary(pa.int8(), pa.string())),
    ]
)
options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
writer = pa.ipc.new_file(fp, schema, options=options)
writer.write(pa.table({&quot;symbol&quot;: [&quot;A&quot;, &quot;B&quot;]}, schema=schema))
writer.write(pa.table({&quot;symbol&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;C&quot;]}, schema=schema))

I think the reason it is not enabled by default is it causes compatibility issues with other implementations that don't support it

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法在向文件追加表时在scheme中使用pyarrow字典。

问题

答案1

饼图的标签是字符串。

在Pandas列中替换数值，使其与另一列中所有唯一值相同。

显示最近X天内客户交易频率。

在一个条件语句中两次更改数据类型（Python）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。