英文:
PySpark: make DataFrame no longer accessible
问题
我的目标是编写两个函数capture
和release
,它们以PySpark DataFrame作为输入,并使其对用户"不可访问"。我希望实现的行为类似于:
df = spark.read.csv("...") # 或者其他创建DataFrame的方式
df.show() # 这应该正常工作
capture(df)
df.show() # 现在这不应该再起作用,而应该引发异常,最好是我可以控制的异常
df.count() # 这也不应该工作
release(df)
df.count() # 现在应该再次正常工作
这个方法最好独立于DataFrame的生成方式,例如,像"重命名底层文件路径"这样的方法并不理想(程序可能没有执行此操作的权限)。
更好的是,还要防止DataFrame被间接使用,例如:
df = spark.read.csv("...") # 或者其他创建DataFrame的方式
df2 = df.withColumnRenamed("foo", "bar")
df2.show() # 这应该正常工作
capture(df)
df2.show() # 现在这不应该再起作用
release(df)
df2.show() # 现在应该再次正常工作
这是否可能?有什么最干净的方法可以实现这样的行为?我不是在寻求提供绝对安全性的解决方案,只是想要一种方式来警告用户,如果他们试图执行一些可能会引发问题的操作,而他们可能没有意识到会导致问题。
关于这个问题的更多背景信息:我们正在构建一个库,以便在PySpark上轻松编写和运行差分隐私管道。为了确保隐私保证成立,私有数据只能被使用一次:作为差分隐私程序的输入,而不允许在其他地方使用。在程序的其他位置使用私有数据(例如,在定义超参数时)是一类常见的问题:用户可能这样做是因为方便,但未意识到这会破坏隐私保证。我们正在寻找捕捉此类错误示例的方法,使私有数据在初始化后无法访问将是一个有用的缓解措施。
英文:
My goal is to write two functions capture
and release
which take a PySpark DataFrame as input and make it "inaccessible" to the user. The behavior I'm looking for is something like:
df = spark.read.csv("...") # or some other way or creating a DataFrame
df.show() # this should work
capture(df)
df.show() # this should no longer work, but return an exception, ideally one I control
df.count() # this should not work either
release(df)
df.count() # now this should work again
The method should ideally be independent of how the DataFrame was generated, e.g. approaches like "renaming the underlying file path" are not great (the program might also not have permissions to do so).
It would be even better to also prevent the DataFrame from being used indirectly, e.g.:
df = spark.read.csv("...") # or some other way or creating a DataFrame
df2 = df.withColumnRenamed("foo", "bar")
df2.show() # this should work
capture(df)
df2.show() # this should not work
release(df)
df2.show() # now this should work again
Is this possible? What's the cleanest way to get a behavior like this? I'm not looking for a solution that provides bulletproof security, just a way of warning the user if they're trying to do something that they likely didn't realize would cause issues.
More context on the question: we're building a library to make it easy to write and run differentially private pipelines on PySpark. For the privacy guarantees to hold, the private data must only be used once: as input to the differentially private program, and nowhere else. Using the private data in other places of the program (e.g. when defining hyperparameters) is a common class of pitfalls: users might do it because it's convenient, and fail to realize that this breaks the privacy guarantees. We're looking for ways of catching the most common examples of this pitfall, and making the private data inaccessible after initialization would be a useful mitigation.
答案1
得分: 1
以下是翻译好的部分:
class CustomDataframe:
def __init__(self, dataframe):
self._dataframe = dataframe
self._locked = False
def __getattr__(self, attr):
"""
当我们访问属性或方法时,我们想获取数据帧对象的属性/方法。
参数:
attr (str): 属性的名称
返回:
数据帧对象“_dataframe”的属性“attr”
"""
if self._locked:
raise Exception("自定义异常")
return getattr(self._dataframe, attr)
def capture(self):
self._locked = True
def release(self):
self._locked = False
在第二个用例中,df2
不具有它最初是 df
的信息。所以我不太知道如何处理它...
英文:
For the first part, I can offer this piece of code :
class CustomDataframe:
def __init__(self, dataframe):
self._dataframe = dataframe
self._locked = False
def __getattr__(self, attr):
"""
When we access an attribute or method we want to get the
attribute/method of the dataframe object.
Args:
attr (str): Name of the attribute
Returns:
Attribute "attr" of the _dataframe object
"""
if self._locked:
raise Exception("custom exception")
return getattr(self._dataframe, attr)
def capture(self):
self._locked = True
def release(self):
self._locked = False
Here, the release
and capture
are not functions but CustomDataframe's methods.
For the second use case, df2
doesn't have the information that it was df
originally. So I do not really know how to do it ...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论