如何将Spark数据集以加密格式保存?

huangapple go评论65阅读模式
英文:

How to save spark dataset in encrypted format?

问题

我正在将我的Spark数据集保存为Parquet文件到本地计算机。我想知道是否有任何方法可以使用某种加密算法加密数据。我用来将数据保存为Parquet文件的代码大致如下。

dataset.write().mode("overwrite").parquet(parquetFile);

我看到了一个类似的问题,但我的问题不同,因为我要写入到本地磁盘。

英文:

I am saving my spark dataset as parquet file in my local machine. I would like to know if there are any ways I could encrypt the data using some encryption algorithm. The code I am using to save my data as parquet file looks something like this.

dataset.write().mode("overwrite").parquet(parquetFile);

I saw a similar question but my query is different as I am writing to my local disk.

答案1

得分: 5

自 Spark 3.2 版本开始,Parquet 表支持列加密。

例如:

hadoopConfiguration.set("parquet.encryption.kms.client.class",
   "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");

// 显式主密钥(Base64 编码)- 仅对模拟的 InMemoryKMS 必需
hadoopConfiguration.set("parquet.encryption.key.list",
   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA==");

// 激活 Parquet 加密,由 Hadoop 属性驱动
hadoopConfiguration.set("parquet.crypto.factory.class",
   "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");

// 写入加密的 DataFrame 文件。
// 列 "square" 将受到主密钥 "keyA" 的保护。
// Parquet 文件页脚将受到主密钥 "keyB" 的保护。
squaresDF.write()
   .option("parquet.encryption.column.keys", "keyA:square")
   .option("parquet.encryption.footer.key", "keyB")
   .parquet("/path/to/table.parquet.encrypted");

// 读取加密的 DataFrame 文件
Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");

此示例基于以下用法示例:
https://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption

英文:

Since Spark 3.2, columnar encryption is supported for Parquet tables.

For example:

hadoopConfiguration.set(&quot;parquet.encryption.kms.client.class&quot; ,
   &quot;org.apache.parquet.crypto.keytools.mocks.InMemoryKMS&quot;);

// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
hadoopConfiguration.set(&quot;parquet.encryption.key.list&quot; ,
   &quot;keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA==&quot;);

// Activate Parquet encryption, driven by Hadoop properties
hadoopConfiguration.set(&quot;parquet.crypto.factory.class&quot; ,
   &quot;org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory&quot;);

// Write encrypted dataframe files. 
// Column &quot;square&quot; will be protected with master key &quot;keyA&quot;.
// Parquet file footers will be protected with master key &quot;keyB&quot;
squaresDF.write().
   option(&quot;parquet.encryption.column.keys&quot; , &quot;keyA:square&quot;).
   option(&quot;parquet.encryption.footer.key&quot; , &quot;keyB&quot;).
   parquet(&quot;/path/to/table.parquet.encrypted&quot;);

// Read encrypted dataframe files
Dataset&lt;Row&gt; df2 = spark.read().parquet(&quot;/path/to/table.parquet.encrypted&quot;);

This is based on the usage example in:
https://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption

答案2

得分: 1

我认为你无法直接在Spark上执行此操作,但是你可以在Parquet周围使用其他项目,特别是Apache Arrow。我认为这个视频解释了如何操作:

https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption

更新:自从Spark 3.2.0版本以后,似乎已经变成了可能。

英文:

I don't think you can do over Spark directly, however there are other projects you can put around Parquet, in special Apache Arrow. I think this video explains how to do it:

https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption

UPDATE: since Spark 3.2.0 it seems possible.

huangapple
  • 本文由 发表于 2020年8月28日 23:15:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/63636566.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定