如何使用pyarrow和parquet对pandas DataFrame进行加密

huangapple go评论103阅读模式
英文:

How to encrypt pandas Dataframe with pyarrow and parquet

问题

我想使用模块化加密将pandas数据框加密为Parquet文件。我认为最好的方法是将数据框转换为pyarrow格式,然后使用ModularEncryption选项保存为Parquet文件。类似于这样:

import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
import pyarrow as pa
schema = pa.Schema.from_pandas(df)
pa.parquet.write_table(df, "test.parquet", encryption_properties=enc_prop)

我的问题是,我在创建encryption_properties时遇到困难。
有人知道如何创建它们吗?

非常感谢,
Seb

英文:

I would like to encrypt pandas dataframe as parquet file using the modular encryption. I tought the best way to do that, is to transform the dataframe to the pyarrow format and then save it to parquet with a ModularEncryption option. Something like this:

import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
import pyarrow as pa
schema = pa.Schema.from_pandas(df)
pa.parquet.write_table(df,"test.parquet",encryption_properties=enc_prop)

My problem is, that I'm stuck with the encrypton_properties creation.
Has anyone a idea how to create them?

Big Thanks,
Seb

答案1

得分: 2

有一个在Apache Arrow仓库中的示例Python文件

> 用于编写加密Parquet文件并使用由Hashicorp Vault KMS管理的主密钥读取加密Parquet文件的示例。

更多信息:

  • Parquet模块化加密(列加密)的常规文档:https://arrow.apache.org/docs/python/parquet.html#parquet-modular-encryption-columnar-encryption
  • 用于编写加密文件的测试:https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/tests/parquet/test_encryption.py#L68-L98

希望对你有帮助。

英文:

There is an example python file in Apache Arrow repo with

> An example for writing an encrypted parquet and reading an encrypted
> parquet using master keys managed by Hashicorp Vault KMS.

More info:

Hope that helps.

答案2

得分: 0

我能够加密和解密Parquet文件,如下所示:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.parquet.encryption as pe
from pyarrow.tests.parquet.encryption import InMemoryKmsClient

# pandas数据帧
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
table = pa.Table.from_pandas(df)

# -------------------------------------------------
# 加密

# 密钥为128位AES密钥
private_key = b"1234567890123450"

encryption_config = pe.EncryptionConfiguration(
    footer_key="footer",
    column_keys={
       "columns": df.columns.tolist(),
    },
    encryption_algorithm="AES_GCM_V1",
    data_key_length_bits=128
)

kms_connection_config = pe.KmsConnectionConfig(
    custom_kms_conf={
        "footer": private_key.decode("UTF-8"),
        "columns": private_key.decode("UTF-8"),
    }
)

def kms_factory(kms_connection_configuration):
    return InMemoryKmsClient(kms_connection_configuration)

crypto_factory = pe.CryptoFactory(kms_factory)

encryption_properties = (
    crypto_factory.file_encryption_properties(
        kms_connection_config,
        encryption_config
    )
)

with pq.ParquetWriter(
    'encrypted_table.parquet',
    table.schema,
    encryption_properties=encryption_properties
) as writer:
    writer.write_table(table)

# -------------------------------------------------
# 解密

decryption_properties = (
    crypto_factory.file_decryption_properties(
        kms_connection_config
    )
)

parquet_file = pq.ParquetFile(
    'encrypted_table.parquet',
    decryption_properties=decryption_properties
)

print(parquet_file.read().to_pandas())
  • 这将加密所有列。如果您想排除某些列,请提及要包含在encryption_config.column_keys["columns"]中的列。
  • 可以使用不同的密钥加密数据帧主体和页脚。我只是为它们都使用了相同的私钥。
  • 128位和256位AES均可使用。
英文:

I am able to encrypt and decrypt parquet files as

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.parquet.encryption as pe
from pyarrow.tests.parquet.encryption import InMemoryKmsClient

# pandas dataframe

d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
table = pa.Table.from_pandas(df)

# -------------------------------------------------
# encryption

# secret 128-bit AES key
private_key = b"1234567890123450"

encryption_config = pe.EncryptionConfiguration(
    footer_key="footer",
    column_keys={
       "columns": df.columns.tolist(),
    },
    encryption_algorithm="AES_GCM_V1",
    data_key_length_bits=128)

kms_connection_config = pe.KmsConnectionConfig(
    custom_kms_conf={
        "footer": private_key.decode("UTF-8"),
        "columns": private_key.decode("UTF-8"),
    }
)

def kms_factory(kms_connection_configuration):
    return InMemoryKmsClient(kms_connection_configuration)

crypto_factory = pe.CryptoFactory(kms_factory)

encryption_properties = (
    crypto_factory.file_encryption_properties(
        kms_connection_config,
        encryption_config
    )
)

with pq.ParquetWriter(
    'encrypted_table.parquet',
    table.schema,
    encryption_properties=encryption_properties
) as writer:
    writer.write_table(table)

# -------------------------------------------------
# decryption

decryption_properties = (
    crypto_factory.file_decryption_properties(
        kms_connection_config
    )
)

parquet_file = pq.ParquetFile(
    'encrypted_table.parquet',
    decryption_properties=decryption_properties
)

print(parquet_file.read().to_pandas())
  • This is going to encrypt all columns. If you want to exclude some, just mention you want to include in encryption_config.column_keys["columns"]
  • Different keys can be used to encrypt dataframe body and footer. I just same private key for them both.
  • Both 128-bit and 256-bit AES are working.

huangapple
  • 本文由 发表于 2023年2月23日 20:21:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75544762.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定