PySpark: 使DataFrame不再可访问

huangapple go评论70阅读模式
英文:

PySpark: make DataFrame no longer accessible

问题

我的目标是编写两个函数capturerelease,它们以PySpark DataFrame作为输入,并使其对用户"不可访问"。我希望实现的行为类似于:

df = spark.read.csv("...") # 或者其他创建DataFrame的方式

df.show() # 这应该正常工作

capture(df)

df.show() # 现在这不应该再起作用,而应该引发异常,最好是我可以控制的异常
df.count() # 这也不应该工作

release(df)

df.count() # 现在应该再次正常工作

这个方法最好独立于DataFrame的生成方式,例如,像"重命名底层文件路径"这样的方法并不理想(程序可能没有执行此操作的权限)。

更好的是,还要防止DataFrame被间接使用,例如:

df = spark.read.csv("...") # 或者其他创建DataFrame的方式
df2 = df.withColumnRenamed("foo", "bar")

df2.show() # 这应该正常工作

capture(df)

df2.show() # 现在这不应该再起作用

release(df)

df2.show() # 现在应该再次正常工作

这是否可能?有什么最干净的方法可以实现这样的行为?我不是在寻求提供绝对安全性的解决方案,只是想要一种方式来警告用户,如果他们试图执行一些可能会引发问题的操作,而他们可能没有意识到会导致问题。

关于这个问题的更多背景信息:我们正在构建一个库,以便在PySpark上轻松编写和运行差分隐私管道。为了确保隐私保证成立,私有数据只能被使用一次:作为差分隐私程序的输入,而不允许在其他地方使用。在程序的其他位置使用私有数据(例如,在定义超参数时)是一类常见的问题:用户可能这样做是因为方便,但未意识到这会破坏隐私保证。我们正在寻找捕捉此类错误示例的方法,使私有数据在初始化后无法访问将是一个有用的缓解措施。

英文:

My goal is to write two functions capture and release which take a PySpark DataFrame as input and make it "inaccessible" to the user. The behavior I'm looking for is something like:

df = spark.read.csv("...") # or some other way or creating a DataFrame

df.show() # this should work

capture(df)

df.show() # this should no longer work, but return an exception, ideally one I control
df.count() # this should not work either

release(df)

df.count() # now this should work again

The method should ideally be independent of how the DataFrame was generated, e.g. approaches like "renaming the underlying file path" are not great (the program might also not have permissions to do so).

It would be even better to also prevent the DataFrame from being used indirectly, e.g.:

df = spark.read.csv("...") # or some other way or creating a DataFrame
df2 = df.withColumnRenamed("foo", "bar")

df2.show() # this should work

capture(df)

df2.show() # this should not work

release(df)

df2.show() # now this should work again

Is this possible? What's the cleanest way to get a behavior like this? I'm not looking for a solution that provides bulletproof security, just a way of warning the user if they're trying to do something that they likely didn't realize would cause issues.

More context on the question: we're building a library to make it easy to write and run differentially private pipelines on PySpark. For the privacy guarantees to hold, the private data must only be used once: as input to the differentially private program, and nowhere else. Using the private data in other places of the program (e.g. when defining hyperparameters) is a common class of pitfalls: users might do it because it's convenient, and fail to realize that this breaks the privacy guarantees. We're looking for ways of catching the most common examples of this pitfall, and making the private data inaccessible after initialization would be a useful mitigation.

答案1

得分: 1

以下是翻译好的部分:

class CustomDataframe:
    def __init__(self, dataframe):
        self._dataframe = dataframe
        self._locked = False

    def __getattr__(self, attr):
        """
        当我们访问属性或方法时我们想获取数据帧对象的属性/方法

        参数:
            attr (str): 属性的名称

        返回:
            数据帧对象_dataframe的属性attr
        """
        if self._locked:
            raise Exception("自定义异常")
        return getattr(self._dataframe, attr)

    def capture(self):
        self._locked = True

    def release(self):
        self._locked = False

在第二个用例中,df2 不具有它最初是 df 的信息。所以我不太知道如何处理它...

英文:

For the first part, I can offer this piece of code :

class CustomDataframe:
    def __init__(self, dataframe):
        self._dataframe = dataframe
        self._locked = False

    def __getattr__(self, attr):
        """
        When we access an attribute or method we want to get the
        attribute/method of the dataframe object.

        Args:
            attr (str): Name of the attribute

        Returns:
            Attribute "attr" of the _dataframe object
        """
        if self._locked:
            raise Exception("custom exception")
        return getattr(self._dataframe, attr)

    def capture(self):
        self._locked = True

    def release(self):
        self._locked = False

Here, the release and capture are not functions but CustomDataframe's methods.


For the second use case, df2 doesn't have the information that it was df originally. So I do not really know how to do it ...

huangapple
  • 本文由 发表于 2023年3月7日 16:56:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75659800.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定