英文:
How to encrypt pandas Dataframe with pyarrow and parquet
问题
我想使用模块化加密将pandas数据框加密为Parquet文件。我认为最好的方法是将数据框转换为pyarrow格式,然后使用ModularEncryption选项保存为Parquet文件。类似于这样:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
import pyarrow as pa
schema = pa.Schema.from_pandas(df)
pa.parquet.write_table(df, "test.parquet", encryption_properties=enc_prop)
我的问题是,我在创建encryption_properties时遇到困难。
有人知道如何创建它们吗?
非常感谢,
Seb
英文:
I would like to encrypt pandas dataframe as parquet file using the modular encryption. I tought the best way to do that, is to transform the dataframe to the pyarrow format and then save it to parquet with a ModularEncryption option. Something like this:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
import pyarrow as pa
schema = pa.Schema.from_pandas(df)
pa.parquet.write_table(df,"test.parquet",encryption_properties=enc_prop)
My problem is, that I'm stuck with the encrypton_properties creation.
Has anyone a idea how to create them?
Big Thanks,
Seb
答案1
得分: 2
有一个在Apache Arrow仓库中的示例Python文件,
> 用于编写加密Parquet文件并使用由Hashicorp Vault KMS管理的主密钥读取加密Parquet文件的示例。
更多信息:
- Parquet模块化加密(列加密)的常规文档:https://arrow.apache.org/docs/python/parquet.html#parquet-modular-encryption-columnar-encryption
- 用于编写加密文件的测试:https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/tests/parquet/test_encryption.py#L68-L98
希望对你有帮助。
英文:
There is an example python file in Apache Arrow repo with
> An example for writing an encrypted parquet and reading an encrypted
> parquet using master keys managed by Hashicorp Vault KMS.
More info:
- Parquet Modular Encryption (Columnar Encryption) general docs: https://arrow.apache.org/docs/python/parquet.html#parquet-modular-encryption-columnar-encryption
- A test for writing encrypted file: https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/tests/parquet/test_encryption.py#L68-L98
Hope that helps.
答案2
得分: 0
我能够加密和解密Parquet文件,如下所示:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.parquet.encryption as pe
from pyarrow.tests.parquet.encryption import InMemoryKmsClient
# pandas数据帧
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
table = pa.Table.from_pandas(df)
# -------------------------------------------------
# 加密
# 密钥为128位AES密钥
private_key = b"1234567890123450"
encryption_config = pe.EncryptionConfiguration(
footer_key="footer",
column_keys={
"columns": df.columns.tolist(),
},
encryption_algorithm="AES_GCM_V1",
data_key_length_bits=128
)
kms_connection_config = pe.KmsConnectionConfig(
custom_kms_conf={
"footer": private_key.decode("UTF-8"),
"columns": private_key.decode("UTF-8"),
}
)
def kms_factory(kms_connection_configuration):
return InMemoryKmsClient(kms_connection_configuration)
crypto_factory = pe.CryptoFactory(kms_factory)
encryption_properties = (
crypto_factory.file_encryption_properties(
kms_connection_config,
encryption_config
)
)
with pq.ParquetWriter(
'encrypted_table.parquet',
table.schema,
encryption_properties=encryption_properties
) as writer:
writer.write_table(table)
# -------------------------------------------------
# 解密
decryption_properties = (
crypto_factory.file_decryption_properties(
kms_connection_config
)
)
parquet_file = pq.ParquetFile(
'encrypted_table.parquet',
decryption_properties=decryption_properties
)
print(parquet_file.read().to_pandas())
- 这将加密所有列。如果您想排除某些列,请提及要包含在
encryption_config.column_keys["columns"]
中的列。 - 可以使用不同的密钥加密数据帧主体和页脚。我只是为它们都使用了相同的私钥。
- 128位和256位AES均可使用。
英文:
I am able to encrypt and decrypt parquet files as
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.parquet.encryption as pe
from pyarrow.tests.parquet.encryption import InMemoryKmsClient
# pandas dataframe
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
table = pa.Table.from_pandas(df)
# -------------------------------------------------
# encryption
# secret 128-bit AES key
private_key = b"1234567890123450"
encryption_config = pe.EncryptionConfiguration(
footer_key="footer",
column_keys={
"columns": df.columns.tolist(),
},
encryption_algorithm="AES_GCM_V1",
data_key_length_bits=128)
kms_connection_config = pe.KmsConnectionConfig(
custom_kms_conf={
"footer": private_key.decode("UTF-8"),
"columns": private_key.decode("UTF-8"),
}
)
def kms_factory(kms_connection_configuration):
return InMemoryKmsClient(kms_connection_configuration)
crypto_factory = pe.CryptoFactory(kms_factory)
encryption_properties = (
crypto_factory.file_encryption_properties(
kms_connection_config,
encryption_config
)
)
with pq.ParquetWriter(
'encrypted_table.parquet',
table.schema,
encryption_properties=encryption_properties
) as writer:
writer.write_table(table)
# -------------------------------------------------
# decryption
decryption_properties = (
crypto_factory.file_decryption_properties(
kms_connection_config
)
)
parquet_file = pq.ParquetFile(
'encrypted_table.parquet',
decryption_properties=decryption_properties
)
print(parquet_file.read().to_pandas())
- This is going to encrypt all columns. If you want to exclude some, just mention you want to include in encryption_config.column_keys["columns"]
- Different keys can be used to encrypt dataframe body and footer. I just same private key for them both.
- Both 128-bit and 256-bit AES are working.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论