如何使用pyarrow和parquet对pandas DataFrame进行加密

huangapple go评论154阅读模式
英文:

How to encrypt pandas Dataframe with pyarrow and parquet

问题

我想使用模块化加密将pandas数据框加密为Parquet文件。我认为最好的方法是将数据框转换为pyarrow格式,然后使用ModularEncryption选项保存为Parquet文件。类似于这样:

  1. import pandas as pd
  2. d = {'col1': [1, 2], 'col2': [3, 4]}
  3. df = pd.DataFrame(data=d)
  4. import pyarrow as pa
  5. schema = pa.Schema.from_pandas(df)
  6. pa.parquet.write_table(df, "test.parquet", encryption_properties=enc_prop)

我的问题是,我在创建encryption_properties时遇到困难。
有人知道如何创建它们吗?

非常感谢,
Seb

英文:

I would like to encrypt pandas dataframe as parquet file using the modular encryption. I tought the best way to do that, is to transform the dataframe to the pyarrow format and then save it to parquet with a ModularEncryption option. Something like this:

  1. import pandas as pd
  2. d = {'col1': [1, 2], 'col2': [3, 4]}
  3. df = pd.DataFrame(data=d)
  4. import pyarrow as pa
  5. schema = pa.Schema.from_pandas(df)
  6. pa.parquet.write_table(df,"test.parquet",encryption_properties=enc_prop)

My problem is, that I'm stuck with the encrypton_properties creation.
Has anyone a idea how to create them?

Big Thanks,
Seb

答案1

得分: 2

有一个在Apache Arrow仓库中的示例Python文件

> 用于编写加密Parquet文件并使用由Hashicorp Vault KMS管理的主密钥读取加密Parquet文件的示例。

更多信息:

  • Parquet模块化加密(列加密)的常规文档:https://arrow.apache.org/docs/python/parquet.html#parquet-modular-encryption-columnar-encryption
  • 用于编写加密文件的测试:https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/tests/parquet/test_encryption.py#L68-L98

希望对你有帮助。

英文:

There is an example python file in Apache Arrow repo with

> An example for writing an encrypted parquet and reading an encrypted
> parquet using master keys managed by Hashicorp Vault KMS.

More info:

Hope that helps.

答案2

得分: 0

我能够加密和解密Parquet文件,如下所示:

  1. import pandas as pd
  2. import pyarrow as pa
  3. import pyarrow.parquet as pq
  4. import pyarrow.parquet.encryption as pe
  5. from pyarrow.tests.parquet.encryption import InMemoryKmsClient
  6. # pandas数据帧
  7. d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
  8. df = pd.DataFrame(data=d)
  9. table = pa.Table.from_pandas(df)
  10. # -------------------------------------------------
  11. # 加密
  12. # 密钥为128位AES密钥
  13. private_key = b"1234567890123450"
  14. encryption_config = pe.EncryptionConfiguration(
  15. footer_key="footer",
  16. column_keys={
  17. "columns": df.columns.tolist(),
  18. },
  19. encryption_algorithm="AES_GCM_V1",
  20. data_key_length_bits=128
  21. )
  22. kms_connection_config = pe.KmsConnectionConfig(
  23. custom_kms_conf={
  24. "footer": private_key.decode("UTF-8"),
  25. "columns": private_key.decode("UTF-8"),
  26. }
  27. )
  28. def kms_factory(kms_connection_configuration):
  29. return InMemoryKmsClient(kms_connection_configuration)
  30. crypto_factory = pe.CryptoFactory(kms_factory)
  31. encryption_properties = (
  32. crypto_factory.file_encryption_properties(
  33. kms_connection_config,
  34. encryption_config
  35. )
  36. )
  37. with pq.ParquetWriter(
  38. 'encrypted_table.parquet',
  39. table.schema,
  40. encryption_properties=encryption_properties
  41. ) as writer:
  42. writer.write_table(table)
  43. # -------------------------------------------------
  44. # 解密
  45. decryption_properties = (
  46. crypto_factory.file_decryption_properties(
  47. kms_connection_config
  48. )
  49. )
  50. parquet_file = pq.ParquetFile(
  51. 'encrypted_table.parquet',
  52. decryption_properties=decryption_properties
  53. )
  54. print(parquet_file.read().to_pandas())
  • 这将加密所有列。如果您想排除某些列,请提及要包含在encryption_config.column_keys["columns"]中的列。
  • 可以使用不同的密钥加密数据帧主体和页脚。我只是为它们都使用了相同的私钥。
  • 128位和256位AES均可使用。
英文:

I am able to encrypt and decrypt parquet files as

  1. import pandas as pd
  2. import pyarrow as pa
  3. import pyarrow.parquet as pq
  4. import pyarrow.parquet.encryption as pe
  5. from pyarrow.tests.parquet.encryption import InMemoryKmsClient
  6. # pandas dataframe
  7. d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
  8. df = pd.DataFrame(data=d)
  9. table = pa.Table.from_pandas(df)
  10. # -------------------------------------------------
  11. # encryption
  12. # secret 128-bit AES key
  13. private_key = b"1234567890123450"
  14. encryption_config = pe.EncryptionConfiguration(
  15. footer_key="footer",
  16. column_keys={
  17. "columns": df.columns.tolist(),
  18. },
  19. encryption_algorithm="AES_GCM_V1",
  20. data_key_length_bits=128)
  21. kms_connection_config = pe.KmsConnectionConfig(
  22. custom_kms_conf={
  23. "footer": private_key.decode("UTF-8"),
  24. "columns": private_key.decode("UTF-8"),
  25. }
  26. )
  27. def kms_factory(kms_connection_configuration):
  28. return InMemoryKmsClient(kms_connection_configuration)
  29. crypto_factory = pe.CryptoFactory(kms_factory)
  30. encryption_properties = (
  31. crypto_factory.file_encryption_properties(
  32. kms_connection_config,
  33. encryption_config
  34. )
  35. )
  36. with pq.ParquetWriter(
  37. 'encrypted_table.parquet',
  38. table.schema,
  39. encryption_properties=encryption_properties
  40. ) as writer:
  41. writer.write_table(table)
  42. # -------------------------------------------------
  43. # decryption
  44. decryption_properties = (
  45. crypto_factory.file_decryption_properties(
  46. kms_connection_config
  47. )
  48. )
  49. parquet_file = pq.ParquetFile(
  50. 'encrypted_table.parquet',
  51. decryption_properties=decryption_properties
  52. )
  53. print(parquet_file.read().to_pandas())
  • This is going to encrypt all columns. If you want to exclude some, just mention you want to include in encryption_config.column_keys["columns"]
  • Different keys can be used to encrypt dataframe body and footer. I just same private key for them both.
  • Both 128-bit and 256-bit AES are working.

huangapple
  • 本文由 发表于 2023年2月23日 20:21:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75544762.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定