2023年5月11日 00:19:27go评论73阅读模式

英文:

Entity resolution - creating a unique identifier based on 3 columns

问题

I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id, and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.

有三列（user_id、universal_id 和 session_id）要创建一个唯一标识符，"expected_result"列是在处理其他三列后应该得到的唯一标识符。

Sometimes user_id is not available, and in that case, the other two columns should be used to create the unique id.
有时候 user_id 不可用，在这种情况下，应使用其他两列来创建唯一标识符。
When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
当 user_id 没有匹配项而 universal_id 有匹配项时，它们应被视为不同的（单独的唯一标识符）。
"id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.
"id" 列是数据写入数据库的顺序。如果出现一个新行与之前的任何行（已经计算出唯一标识符）匹配，无论是哪三列，已经存在的唯一标识符应该被添加到新行。

Here's a list of possible relationships between columns:
以下是列之间可能的关系列表：

user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
user_id:universal_id = 1:N 或 N:1（如果是 N:1，那么每个 N 需要一个唯一标识符）
user_id:session_id = 1:N
universal_id:session_id = 1:N or N:1
user_id:session_id = 1:N
universal_id:session_id = 1:N 或 N:1

I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!

我试图在 Python 中找到一种方法（或者可能在百万行数据上使用 PySpark），以帮助我对这些数据进行聚类（或者无论在数据科学中如何称呼这个过程）。思路是创建一个 universal_id:unique_id 的映射。如果您知道如何做到这一点，请帮忙，或者至少指引我应该研究的主题，以便能够完成这个任务。谢谢！

I have Snowflake and Databricks at my disposal.
我可以使用 Snowflake 和 Databricks。

英文:

I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.

Sometimes user_id is not available, and in that case the other two columns should be used to create the unique id.
When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
"id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.

Here's a list of possible relationships between columns:

user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
user_id:session_id = 1:N
universal_id:session_id = 1:N or N:1

I have Snowflake and Databricks at my disposal.

Here's my test dataset:

import pandas as pd

data = [
    [1, 1, &#39;apple&#39;, &#39;fiat&#39;, 1],
    [2, 1, &#39;pear&#39;, &#39;bmw&#39;, 1],
    [3, 2, &#39;bananna&#39;, &#39;citroen&#39;, 2],
    [4, 3, &#39;bananna&#39;, &#39;kia&#39;, 3],
    [5, 4, &#39;blueberry&#39;, &#39;peugeot&#39;, 4],
    [6, None, &#39;blueberry&#39;, &#39;peugeot&#39;, 4],
    [7, None, &#39;blueberry&#39;, &#39;yamaha&#39;, 4],
    [8, 5, &#39;plum&#39;, &#39;ford&#39;, 5],
    [9, None, &#39;watermelon&#39;, &#39;ford&#39;, 5],
    [10, None, &#39;raspberry&#39;, &#39;honda&#39;, 6],
    [11, None, &#39;raspberry&#39;, &#39;toyota&#39;, 6],
    [12, None, &#39;avocado&#39;, &#39;mercedes&#39;, 7],
    [13, None, &#39;cherry&#39;, &#39;mercedes&#39;, 7],
    [14, None, &#39;apricot&#39;, &#39;volkswagen&#39;, 2],
    [15, 2, &#39;apricot&#39;, &#39;volkswagen&#39;, 2],
    [16, 6, &#39;blueberry&#39;, &#39;audi&#39;, 8],
    [17, None, &#39;blackberry&#39;, &#39;bmw&#39;, 1],
    [18, 7, &#39;plum&#39;, &#39;porsche&#39;, 9]
]

df = pd.DataFrame(data, columns=[&#39;id&#39;, &#39;user_id&#39;, &#39;universal_id&#39;, &#39;session_id&#39;, &#39;expected_result&#39;])

答案1

得分: 1

根据您描述的内容，我们可以制定如下算法，将新的ID称为global_id。 更新： 当多个user_id匹配多个universal_id时，算法现在具有任意的决定性因素。 更新： 由于您担心完全随机生成的UUID4可能会发生重复，我为您编写了一个小函数，允许您生成UUID，同时利用UUID1和/或UUID4 - 我个人认为不必担心UUID4值的冲突，但这取决于您。

为每个具有多次出现（n>1）的user_id创建一个新的global_id
将global_id的值传播到所有具有匹配的user_id的行
为每个具有单次出现（n=1）的user_id创建一个新的global_id
将global_id的值传播到所有具有匹配的universal_id的行，即不匹配user_id但匹配universal_id的行。如果多个universal_id在一个或多个user_id上匹配，则会进行任意的决定性因素，所有匹配的universal_id都分配给相同的user_id
为每个不能与user_id链接的具有多次出现（n>1）的universal_id创建新的global_id
将global_id的值传播到所有具有匹配的universal_id的行
将现有的global_id的值传播到所有具有匹配的session_id的行，即在user_id或universal_id上都不匹配但在session_id上匹配的行
为每个不能与user_id或universal_id链接的具有多次出现（n>1）的session_id创建新的global_id
将global_id的值传播到所有具有匹配的session_id的行
（在您的示例中不需要，但可能有用）为每个不具有多次出现（n=1）的session_id创建新的global_id

希望这有所帮助！

英文:

Based on what you described, we can formulate an algorithm as follows, referring to the new ID as global_id. Update: The algorithm now features the arbitrary tie-break when multiple user_ids match multiple universal_ids. Update: Since you were concerned about the risk of duplicates using fully randomly generated UUID4s, I coded you a little function which allows you to generate a UUID leveraging both UUID1 and/or UUID4 - I personally would not be worried about clashes of UUID4 values whatsoever, but it's up to you.

Create a new global_id for every user_id that has
multiple occurrences (n>1)
Propagate values for global_id to all rows with matching
user_id
Create a new global_id for every user_id that has single occurrence (n=1)
Propagate values for global_id to all rows with matching universal_id, i.e. rows which don't match on user_id but match on universal_id. There is an arbitrary tie break if multiple universal_ids match on one or more user_ids, where all matching universal_ids are assigned to the same user_id
Create a new global_id for every universal_id which
cannot be linked to a user_id but has multiple occurrences
(n>1)
Propagate values for global_id to all rows with
matching universal_id
Propagate existing values for global_id to all rows with matching session_id, i.e. rows with don't match on neither user_id or universal_id but on session_id
Create a new global_id for every session_id which cannot
be linked to neither a user_id nor universal_id but has
multiple occurrences (n>1)
Propagate values for global_id to all rows with matching session_id
(not needed in your example but might be useful): Create a
new global_id for every session_id which does not have
multiple occurrences (n=1)

Hope this helps!

import uuid
import pandas as pd
import numpy as np


data = [
    [1, 1, &#39;apple&#39;, &#39;fiat&#39;, 1],
    [2, 1, &#39;pear&#39;, &#39;bmw&#39;, 1],
    [3, 2, &#39;bananna&#39;, &#39;citroen&#39;, 2],
    [4, 3, &#39;bananna&#39;, &#39;kia&#39;, 3],
    [5, 4, &#39;blueberry&#39;, &#39;peugeot&#39;, 4],
    [6, None, &#39;blueberry&#39;, &#39;peugeot&#39;, 4],
    [7, None, &#39;blueberry&#39;, &#39;yamaha&#39;, 4],
    [8, 5, &#39;plum&#39;, &#39;ford&#39;, 5],
    [9, None, &#39;watermelon&#39;, &#39;ford&#39;, 5],
    [10, None, &#39;raspberry&#39;, &#39;honda&#39;, 6],
    [11, None, &#39;raspberry&#39;, &#39;toyota&#39;, 6],
    [12, None, &#39;avocado&#39;, &#39;mercedes&#39;, 7],
    [13, None, &#39;cherry&#39;, &#39;mercedes&#39;, 7],
    [14, None, &#39;apricot&#39;, &#39;volkswagen&#39;, 2],
    [15, 2, &#39;apricot&#39;, &#39;volkswagen&#39;, 2],
    [16, 6, &#39;blueberry&#39;, &#39;audi&#39;, 8],
    [17, None, &#39;blackberry&#39;, &#39;bmw&#39;, 1],
    [18, 7, &#39;plum&#39;, &#39;porsche&#39;, 9]
]


def generate_uuid(use_uuid1: bool=True, use_uuid4: bool=False) -&gt; str:
    &quot;&quot;&quot;Helper function creating UUIDs.
    
    Arguments:
        use_uuid1: Whether generated UUID string should feature a UUID1 part.
            Defaults to `True`.
        use_uuid1: Whether generated UUID string should feature a UUID4 part.
            Defaults to `False`.
    
    Returns:
        Universally unique identifier based on UUID1 and/or UUID4.
    &quot;&quot;&quot;
    uuid_str = &quot;&quot;

    if not use_uuid1 and not use_uuid4:
        raise ValueError(&quot;Both use_uuid1 and use_uuid4 are set to `False`, cannot create UUID.&quot;)

    elif use_uuid1 and use_uuid4:
        uuid_str += f&quot;{str(uuid.uuid1())}-{str(uuid.uuid4())}&quot;
    elif use_uuid1:
        uuid_str += str(uuid.uuid1())
    else:
        uuid_str += str(uuid.uuid4())
    
    return uuid_str


df = pd.DataFrame(data, columns=[&#39;id&#39;, &#39;user_id&#39;, &#39;universal_id&#39;, &#39;session_id&#39;, &#39;expected_result&#39;])

# STEP 1
df.sort_values(by=&#39;user_id&#39;, inplace=True)
df[&#39;_same_user_id&#39;] = (
    (df[&#39;user_id&#39;] == df[&#39;user_id&#39;].shift(-1))
    &amp; (df[&#39;user_id&#39;] != df[&#39;user_id&#39;].shift(1))
)
df[&#39;global_id&#39;] = [generate_uuid() if value else np.NaN for value in df[&#39;_same_user_id&#39;].values]

# STEP 2
df[&#39;global_id&#39;] = df.groupby(&#39;user_id&#39;)[&#39;global_id&#39;].ffill()

# STEP 3
df[&#39;_new_ids&#39;] = [generate_uuid() if not np.isnan(value) else np.NaN for value in df[&#39;user_id&#39;].values]
df[&#39;global_id&#39;].fillna(df[&#39;_new_ids&#39;], inplace=True)

# # STEP 4
df.sort_values(by=&#39;universal_id&#39;, inplace=True)
df[&#39;global_id&#39;] = df.groupby(&#39;universal_id&#39;)[&#39;global_id&#39;].ffill()

# STEP 5
df[&#39;_count_universal_id&#39;] = df[&#39;universal_id&#39;].groupby(df[&#39;universal_id&#39;]).transform(&#39;count&#39;)
df[&#39;_same_universal_id&#39;] = (
    (df[&#39;universal_id&#39;] == df[&#39;universal_id&#39;].shift(-1))
    &amp; (df[&#39;universal_id&#39;] != df[&#39;universal_id&#39;].shift(1))
)
df[&#39;_new_id_for_universal_id&#39;] = (
    df[&#39;_count_universal_id&#39;].gt(1)
    &amp; (df[&#39;global_id&#39;].isnull()) 
    &amp; df[&#39;_same_universal_id&#39;]
)
df[&#39;_new_ids&#39;] = [generate_uuid() if value else np.NaN for value in df[&#39;_new_id_for_universal_id&#39;].values]
df[&#39;global_id&#39;].fillna(df[&#39;_new_ids&#39;], inplace=True)

# STEP 6
df[&#39;global_id&#39;] = df.groupby(&#39;universal_id&#39;)[&#39;global_id&#39;].ffill()


# STEP 7
df.sort_values(by=&#39;session_id&#39;, inplace=True)
df[&#39;global_id&#39;] = df.groupby(&#39;session_id&#39;)[&#39;global_id&#39;].ffill()

# STEP 8
df[&#39;_count_session_id&#39;] = df[&#39;session_id&#39;].groupby(df[&#39;session_id&#39;]).transform(&#39;count&#39;)
df[&#39;_same_session_id&#39;] = (
    (df[&#39;session_id&#39;] == df[&#39;session_id&#39;].shift(-1))
    &amp; (df[&#39;session_id&#39;] != df[&#39;session_id&#39;].shift(1))
)
df[&#39;_new_id_for_session_id&#39;] = (
    df[&#39;_count_session_id&#39;].gt(1)
    &amp; (df[&#39;global_id&#39;].isnull()) 
    &amp; df[&#39;_same_session_id&#39;]
)
df[&#39;_new_ids&#39;] = [generate_uuid() if value else np.NaN for value in df[&#39;_new_id_for_session_id&#39;].values]
df[&#39;global_id&#39;].fillna(df[&#39;_new_ids&#39;], inplace=True)

# STEP 9
df[&#39;global_id&#39;] = df.groupby(&#39;session_id&#39;)[&#39;global_id&#39;].ffill()

# STEP 10
df[&#39;_new_ids&#39;] = [generate_uuid() if value == 1 else np.NaN for value in df[&#39;_count_session_id&#39;].values]
df[&#39;global_id&#39;].fillna(df[&#39;_new_ids&#39;], inplace=True)

# DROP INTERNAL COLUMNS
cols_to_drop = [col for col in df.columns if col.startswith(&quot;_&quot;)]
df.drop(columns=cols_to_drop, inplace=True)

Results (since we're now using UUID1, the ID's look very similar but they are not the same).

|   id |   user_id | universal_id   | session_id   |   expected_result | global_id                            |
|-----:|----------:|:---------------|:-------------|------------------:|:-------------------------------------|
|    2 |         1 | pear           | bmw          |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
|   17 |       nan | blackberry     | bmw          |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
|    1 |         1 | apple          | fiat         |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
|    3 |         2 | bananna        | citroen      |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
|   14 |       nan | apricot        | volkswagen   |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
|   15 |         2 | apricot        | volkswagen   |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
|    4 |         3 | bananna        | kia          |                 3 | ee530d00-f0da-11ed-8f35-0242ac1c000c |
|    5 |         4 | blueberry      | peugeot      |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
|    6 |       nan | blueberry      | peugeot      |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
|    7 |       nan | blueberry      | yamaha       |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
|    9 |       nan | watermelon     | ford         |                 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
|    8 |         5 | plum           | ford         |                 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
|   16 |         6 | blueberry      | audi         |                 8 | ee53100c-f0da-11ed-8f35-0242ac1c000c |
|   18 |         7 | plum           | porsche      |                 9 | ee531106-f0da-11ed-8f35-0242ac1c000c |
|   10 |       nan | raspberry      | honda        |                 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
|   11 |       nan | raspberry      | toyota       |                 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
|   13 |       nan | cherry         | mercedes     |                 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
|   12 |       nan | avocado        | mercedes     |                 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

实体解析 – 基于3列创建唯一标识符

问题

答案1

时间复杂度在使用Python字符串时

为什么在使用JavaLang Python包时会在某些Java文件上引发异常？

如何进行网页抓取以获取所有数据

anaconda 3 目录在卸载后仍然存在。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论