2020年8月28日 23:15:57go评论116阅读模式

英文:

How to save spark dataset in encrypted format?

问题

我正在将我的Spark数据集保存为Parquet文件到本地计算机。我想知道是否有任何方法可以使用某种加密算法加密数据。我用来将数据保存为Parquet文件的代码大致如下。

dataset.write().mode("overwrite").parquet(parquetFile);

我看到了一个类似的问题，但我的问题不同，因为我要写入到本地磁盘。

英文:

I am saving my spark dataset as parquet file in my local machine. I would like to know if there are any ways I could encrypt the data using some encryption algorithm. The code I am using to save my data as parquet file looks something like this.

dataset.write().mode("overwrite").parquet(parquetFile);

I saw a similar question but my query is different as I am writing to my local disk.

答案1

得分: 5

自 Spark 3.2 版本开始，Parquet 表支持列加密。

例如：

hadoopConfiguration.set("parquet.encryption.kms.client.class",
   "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");
// 显式主密钥（Base64 编码）- 仅对模拟的 InMemoryKMS 必需
hadoopConfiguration.set("parquet.encryption.key.list",
   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA==");
// 激活 Parquet 加密，由 Hadoop 属性驱动
hadoopConfiguration.set("parquet.crypto.factory.class",
   "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");
// 写入加密的 DataFrame 文件。
// 列 "square" 将受到主密钥 "keyA" 的保护。
// Parquet 文件页脚将受到主密钥 "keyB" 的保护。
squaresDF.write()
   .option("parquet.encryption.column.keys", "keyA:square")
   .option("parquet.encryption.footer.key", "keyB")
   .parquet("/path/to/table.parquet.encrypted");
// 读取加密的 DataFrame 文件
Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");

此示例基于以下用法示例：
https://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption

英文:

Since Spark 3.2, columnar encryption is supported for Parquet tables.

For example:

hadoopConfiguration.set(&quot;parquet.encryption.kms.client.class&quot; ,
   &quot;org.apache.parquet.crypto.keytools.mocks.InMemoryKMS&quot;);
// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
hadoopConfiguration.set(&quot;parquet.encryption.key.list&quot; ,
   &quot;keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA==&quot;);
// Activate Parquet encryption, driven by Hadoop properties
hadoopConfiguration.set(&quot;parquet.crypto.factory.class&quot; ,
   &quot;org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory&quot;);
// Write encrypted dataframe files. 
// Column &quot;square&quot; will be protected with master key &quot;keyA&quot;.
// Parquet file footers will be protected with master key &quot;keyB&quot;
squaresDF.write().
   option(&quot;parquet.encryption.column.keys&quot; , &quot;keyA:square&quot;).
   option(&quot;parquet.encryption.footer.key&quot; , &quot;keyB&quot;).
   parquet(&quot;/path/to/table.parquet.encrypted&quot;);
// Read encrypted dataframe files
Dataset&lt;Row&gt; df2 = spark.read().parquet(&quot;/path/to/table.parquet.encrypted&quot;);

This is based on the usage example in:
https://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption

答案2

得分: 1

我认为你无法直接在Spark上执行此操作，但是你可以在Parquet周围使用其他项目，特别是Apache Arrow。我认为这个视频解释了如何操作：

https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption

更新：自从Spark 3.2.0版本以后，似乎已经变成了可能。

英文:

I don't think you can do over Spark directly, however there are other projects you can put around Parquet, in special Apache Arrow. I think this video explains how to do it:

https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption

UPDATE: since Spark 3.2.0 it seems possible.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将Spark数据集以加密格式保存？

问题

答案1

答案2

扫描器在Java中返回空字符串。

如何将一个分支的更改转移到另一个分支，如果文件名已更改？

Java Spark SQL: 合并和覆盖具有相同模式的数据集

如何为`/{repository}/{id}/{property}`的Spring Data Rest端点创建验证If-Match头部？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。