英文:
Can't use pyarrow dictionary in scheme when appending tables to file
问题
以下是代码部分的翻译:
import pyarrow as pa
fp = pa.OSFile('my_file.arrow', 'wb')
schema = pa.schema(
[
('localtime', pa.timestamp('ns')),
('exchtime', pa.timestamp('ns')),
('msgtype', pa.string()),
('symbol', pa.dictionary(pa.int8(), pa.string())),
('exch', pa.string()),
('price', pa.float64()),
('size', pa.float64()),
('side', pa.dictionary(pa.int8(), pa.string())),
('ref', pa.uint64()),
('oldref', pa.uint64()),
('mpid', pa.string()),
('esd', pa.dictionary(pa.int8(), pa.string())),
('platform', pa.dictionary(pa.int8(), pa.string())),
]
)
writer = pa.ipc.new_file(fp, schema)
...
for df in [... 数据帧块的列表 ...]:
table = pa.Table.from_pandas(df, schema)
writer.write(table)
writer.close()
fp.close()
请注意,这是您提供的代码的翻译部分,不包括问题和其他内容。如果您有任何其他需要或问题,请随时提出。
英文:
I have a very large dataset which i am attempting to process in chunks. The source data is a pandas dataframe, which I am able to read in pieces and process chunk-by-chunk. I am attempting to then write the chunks to a pyarrow feather file, appending chunks as they are processed. When writing these chunks I am using the following approach:
import pyarrow as pa
fp = pa.OSFile('my_file.arrow', 'wb')
schema = pa.schema(
[
('localtime', pa.timestamp('ns')),
('exchtime', pa.timestamp('ns')),
('msgtype', pa.string()),
('symbol', pa.dictionary(pa.int8(), pa.string())),
('exch', pa.string()),
('price', pa.float64()),
('size', pa.float64()),
('side', pa.dictionary(pa.int8(), pa.string())),
('ref', pa.uint64()),
('oldref', pa.uint64()),
('mpid', pa.string()),
('esd', pa.dictionary(pa.int8(), pa.string())),
('platform', pa.dictionary(pa.int8(), pa.string())),
]
)
writer = pa.ipc.new_file(fp, schema)
...
for df in [... list of dataframes chunks ...]:
table = pa.Table.from_pandas(df, schema)
writer.write(table)
writer.close()
fp.close()
This all works fine if I don't use the pa.dictionary() data type in the schema. Unfortunately, this also results in very large files, since pyarrow isn't able to index string fields with common repeating values (e.g. "symbol" in the example above has the same string in every entry; "exch" is one of ~20 values, etc).
Running the code as written above generates an error string: "Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches."
I think the issue is that, even when you use an explicit schema for converting the dataframe to a table, the dictionary mapping of strings to values might be different depending on which particular strings appear.
Is there a way to explicitly provide this dictionary mapping (of string to integer) if the possible string values can be known ahead of time?
Some things that I've tried:
- Works if you don't use dictionary types
- Dictionary types work if every chunk has the same value in a column, e.g. symbol
- Doesn't work if you infer the schema for each chunk, rather than explicitly setting it, because schema inference is not constant across chunks
答案1
得分: 0
IpcWriteOptions
中的 emit_dictionary_deltas
必须设置为 True
才能使此功能生效。
import pyarrow as pa
fp = pa.OSFile("my_file.arrow", "wb")
schema = pa.schema(
[
("symbol", pa.dictionary(pa.int8(), pa.string())),
]
)
options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
writer = pa.ipc.new_file(fp, schema, options=options)
writer.write(pa.table({"symbol": ["A", "B"]}, schema=schema))
writer.write(pa.table({"symbol": ["A", "B", "C"]}, schema=schema))
我认为默认情况下未启用此选项的原因是它可能会导致与不支持它的其他实现的兼容性问题。
英文:
emit_dictionary_deltas
in IpcWriteOptions must be set to True
in order for this to work.
import pyarrow as pa
fp = pa.OSFile("my_file.arrow", "wb")
schema = pa.schema(
[
("symbol", pa.dictionary(pa.int8(), pa.string())),
]
)
options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
writer = pa.ipc.new_file(fp, schema, options=options)
writer.write(pa.table({"symbol": ["A", "B"]}, schema=schema))
writer.write(pa.table({"symbol": ["A", "B", "C"]}, schema=schema))
I think the reason it is not enabled by default is it causes compatibility issues with other implementations that don't support it
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论