无法在向文件追加表时在scheme中使用pyarrow字典。

huangapple go评论64阅读模式
英文:

Can't use pyarrow dictionary in scheme when appending tables to file

问题

以下是代码部分的翻译:

import pyarrow as pa
fp = pa.OSFile('my_file.arrow', 'wb')
schema = pa.schema(
    [
        ('localtime', pa.timestamp('ns')),
        ('exchtime', pa.timestamp('ns')),
        ('msgtype', pa.string()),
        ('symbol', pa.dictionary(pa.int8(), pa.string())),
        ('exch', pa.string()),
        ('price', pa.float64()),
        ('size', pa.float64()),
        ('side', pa.dictionary(pa.int8(), pa.string())),
        ('ref', pa.uint64()),
        ('oldref', pa.uint64()),
        ('mpid', pa.string()),
        ('esd', pa.dictionary(pa.int8(), pa.string())),
        ('platform', pa.dictionary(pa.int8(), pa.string())),
    ]
)
writer = pa.ipc.new_file(fp, schema)

...

for df in [... 数据帧块的列表 ...]:
    table = pa.Table.from_pandas(df, schema)
    writer.write(table)

writer.close()
fp.close()

请注意,这是您提供的代码的翻译部分,不包括问题和其他内容。如果您有任何其他需要或问题,请随时提出。

英文:

I have a very large dataset which i am attempting to process in chunks. The source data is a pandas dataframe, which I am able to read in pieces and process chunk-by-chunk. I am attempting to then write the chunks to a pyarrow feather file, appending chunks as they are processed. When writing these chunks I am using the following approach:

  import pyarrow as pa
  fp = pa.OSFile('my_file.arrow', 'wb')
  schema = pa.schema(
      [
          ('localtime', pa.timestamp('ns')),
          ('exchtime', pa.timestamp('ns')),
          ('msgtype', pa.string()),
          ('symbol', pa.dictionary(pa.int8(), pa.string())),
          ('exch', pa.string()),
          ('price', pa.float64()),
          ('size', pa.float64()),
          ('side', pa.dictionary(pa.int8(), pa.string())),
          ('ref', pa.uint64()),
          ('oldref', pa.uint64()),
          ('mpid', pa.string()),
          ('esd', pa.dictionary(pa.int8(), pa.string())),
          ('platform', pa.dictionary(pa.int8(), pa.string())),
      ]
  )
  writer = pa.ipc.new_file(fp, schema)

  ...

  for df in [... list of dataframes chunks ...]:
    table = pa.Table.from_pandas(df, schema)
    writer.write(table)

  writer.close()
  fp.close()

This all works fine if I don't use the pa.dictionary() data type in the schema. Unfortunately, this also results in very large files, since pyarrow isn't able to index string fields with common repeating values (e.g. "symbol" in the example above has the same string in every entry; "exch" is one of ~20 values, etc).

Running the code as written above generates an error string: "Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches."

I think the issue is that, even when you use an explicit schema for converting the dataframe to a table, the dictionary mapping of strings to values might be different depending on which particular strings appear.

Is there a way to explicitly provide this dictionary mapping (of string to integer) if the possible string values can be known ahead of time?

Some things that I've tried:

  • Works if you don't use dictionary types
  • Dictionary types work if every chunk has the same value in a column, e.g. symbol
  • Doesn't work if you infer the schema for each chunk, rather than explicitly setting it, because schema inference is not constant across chunks

答案1

得分: 0

IpcWriteOptions 中的 emit_dictionary_deltas 必须设置为 True 才能使此功能生效。

import pyarrow as pa

fp = pa.OSFile("my_file.arrow", "wb")
schema = pa.schema(
    [
        ("symbol", pa.dictionary(pa.int8(), pa.string())),
    ]
)

options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
writer = pa.ipc.new_file(fp, schema, options=options)
writer.write(pa.table({"symbol": ["A", "B"]}, schema=schema))
writer.write(pa.table({"symbol": ["A", "B", "C"]}, schema=schema))

我认为默认情况下未启用此选项的原因是它可能会导致与不支持它的其他实现的兼容性问题。

英文:

emit_dictionary_deltas in IpcWriteOptions must be set to True in order for this to work.

import pyarrow as pa

fp = pa.OSFile("my_file.arrow", "wb")
schema = pa.schema(
    [
        ("symbol", pa.dictionary(pa.int8(), pa.string())),
    ]
)

options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
writer = pa.ipc.new_file(fp, schema, options=options)
writer.write(pa.table({"symbol": ["A", "B"]}, schema=schema))
writer.write(pa.table({"symbol": ["A", "B", "C"]}, schema=schema))

I think the reason it is not enabled by default is it causes compatibility issues with other implementations that don't support it

huangapple
  • 本文由 发表于 2023年6月16日 01:56:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76484338.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定