识别列描述中的个人信息

huangapple go评论98阅读模式
英文:

Identifying personnal information from column description

问题

我有一个关于识别与GDPR(通用数据保护条例)相关的句子的问题。
是否有一种在Python、Java等中识别数据库列是否仅从其描述中包含个人可识别信息的工具/方法?

我们可以考虑使用词嵌入来获取给定句子的“most_similar”或“most_similar_cosmul”单词,然后识别与GDPR相关的关键词(生物识别、个人、身份、照片...),但结果取决于词嵌入模型的鲁棒性。

提前感谢。

英文:

I have a question about the identification of GDPR (General Data Protection Regulation) related sentences.
Is there a tool / method in Python, Java, ... that identifies whether a database column contains personnally identifiable information from its description only ?

We may think about using word embedding to get the "most_similar" or "most_similar_cosmul" words given a sentence and afterwards identifying keywords related to GDPR (biometric, personnal, id, photo...) but the results depend on the robustness of the word embedding model.

Thank you in advance,

答案1

得分: 0

以下是您要翻译的内容:

在GDPR中,不存在“个人可识别信息”这种说法。该术语(来自GDPR第4(1)条)是“个人数据”,定义如下:

任何与已识别或可识别的自然人相关的信息

而它本身并不必须具有识别性才能符合资格。什么是“可识别的自然人”?GDPR规定如下:

可识别的自然人是指可以直接或间接地通过诸如姓名、身份证号码、位置数据、在线标识符或与该自然人的身体、生理、遗传、精神、经济、文化或社会身份相关的一个或多个特定因素的参考来识别的自然人

在这里,将普通的“数据”转化为“个人数据”的关键是“一个或多个因素”的表述。一个单独的字段,如电话号码,可能合理地被认为是唯一标识一个人的方式。一个邮政编码本身可能不会,但如果与街道地址和名字结合在一起,我们将非常接近能够识别某人,因此所有其他数据都将成为“个人数据”。很难评估一组字段是否足以唯一识别某人 - 你可能认为名字和城市可能无法识别个人,考虑到“约翰”和“伦敦”,但“埃斯梅雷达”和“乌兰巴托”可能会很容易追踪,这是“最坏情况”是最重要的。

举个简单的例子:一个颜色值,如#663399 单独 只是普通的“数据”,不是“个人数据”,也不受GDPR的约束。但是,将完全相同的值存储为与某人相关联的表中的“喜欢的颜色”字段,那么它就是个人数据。表中的“城市”不是个人数据,但用户表中的“城市”字段是。

简而言之,您不能按字段名称确定它是否是个人数据,因为您没有足够的上下文。

英文:

There is no such thing as "personally identifiable information" in GDPR. The term (from GDPR article 4(1)) is "personal data", defined as:

> any information relating to an identified or identifiable natural person

and it doesn't itself have to be identifying to qualify. What's an "identifiable natural person"? GDPR says:

> an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person

The key thing that turns regular "data" into "personal data" here is that "one or more factors" phrase. A single field, such as a phone number, could reasonably be considered as uniquely identifying a person. By itself a postal code probably doesn't, but when combined with a street address and a first name, we'd be very close to being able to identify someone, and hence all other data would become "personal". It's hard to evaluate whether a collection of fields is enough to uniquely identify someone or not – you might think that first name and city might not identify an individual, given "John" and "London", but "Esmerelda" and "Ulaanbaatar" might be pretty easy to track down, and it's the "worst case" that counts.

For a simpler example: A colour value such as #663399 by itself is just plain "data", is not "personal data", and is not subject to GDPR. That exact same value stored as "favourite colour" in a field in a table linking that data to a person is personal data. "City" in a table of cities is not personal data, but a "city" field in a user table is.

In short, you're not going to be able to do what you want. You can't tell whether a field is personal data or not from its name because you have insufficient context.

huangapple
  • 本文由 发表于 2020年7月28日 16:48:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/63130327.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定