英文:
How to save spark dataset in encrypted format?
问题
我正在将我的Spark数据集保存为Parquet文件到本地计算机。我想知道是否有任何方法可以使用某种加密算法加密数据。我用来将数据保存为Parquet文件的代码大致如下。
dataset.write().mode("overwrite").parquet(parquetFile);
我看到了一个类似的问题,但我的问题不同,因为我要写入到本地磁盘。
英文:
I am saving my spark dataset as parquet file in my local machine. I would like to know if there are any ways I could encrypt the data using some encryption algorithm. The code I am using to save my data as parquet file looks something like this.
dataset.write().mode("overwrite").parquet(parquetFile);
I saw a similar question but my query is different as I am writing to my local disk.
答案1
得分: 5
自 Spark 3.2 版本开始,Parquet 表支持列加密。
例如:
hadoopConfiguration.set("parquet.encryption.kms.client.class",
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");
// 显式主密钥(Base64 编码)- 仅对模拟的 InMemoryKMS 必需
hadoopConfiguration.set("parquet.encryption.key.list",
"keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==");
// 激活 Parquet 加密,由 Hadoop 属性驱动
hadoopConfiguration.set("parquet.crypto.factory.class",
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");
// 写入加密的 DataFrame 文件。
// 列 "square" 将受到主密钥 "keyA" 的保护。
// Parquet 文件页脚将受到主密钥 "keyB" 的保护。
squaresDF.write()
.option("parquet.encryption.column.keys", "keyA:square")
.option("parquet.encryption.footer.key", "keyB")
.parquet("/path/to/table.parquet.encrypted");
// 读取加密的 DataFrame 文件
Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");
此示例基于以下用法示例:
https://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption
英文:
Since Spark 3.2, columnar encryption is supported for Parquet tables.
For example:
hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");
// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
hadoopConfiguration.set("parquet.encryption.key.list" ,
"keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==");
// Activate Parquet encryption, driven by Hadoop properties
hadoopConfiguration.set("parquet.crypto.factory.class" ,
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");
// Write encrypted dataframe files.
// Column "square" will be protected with master key "keyA".
// Parquet file footers will be protected with master key "keyB"
squaresDF.write().
option("parquet.encryption.column.keys" , "keyA:square").
option("parquet.encryption.footer.key" , "keyB").
parquet("/path/to/table.parquet.encrypted");
// Read encrypted dataframe files
Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");
This is based on the usage example in:
https://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption
答案2
得分: 1
我认为你无法直接在Spark上执行此操作,但是你可以在Parquet周围使用其他项目,特别是Apache Arrow。我认为这个视频解释了如何操作:
https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption
更新:自从Spark 3.2.0版本以后,似乎已经变成了可能。
英文:
I don't think you can do over Spark directly, however there are other projects you can put around Parquet, in special Apache Arrow. I think this video explains how to do it:
https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption
UPDATE: since Spark 3.2.0 it seems possible.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论