实体解析 – 基于3列创建唯一标识符

huangapple go评论65阅读模式
英文:

Entity resolution - creating a unique identifier based on 3 columns

问题

I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id, and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.

有三列(user_id、universal_id 和 session_id)要创建一个唯一标识符,"expected_result"列是在处理其他三列后应该得到的唯一标识符。

  • Sometimes user_id is not available, and in that case, the other two columns should be used to create the unique id.

  • 有时候 user_id 不可用,在这种情况下,应使用其他两列来创建唯一标识符。

  • When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).

  • 当 user_id 没有匹配项而 universal_id 有匹配项时,它们应被视为不同的(单独的唯一标识符)。

  • "id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.

  • "id" 列是数据写入数据库的顺序。如果出现一个新行与之前的任何行(已经计算出唯一标识符)匹配,无论是哪三列,已经存在的唯一标识符应该被添加到新行。

Here's a list of possible relationships between columns:
以下是列之间可能的关系列表:

  • user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
  • user_id:universal_id = 1:N 或 N:1(如果是 N:1,那么每个 N 需要一个唯一标识符)
  • user_id:session_id = 1:N
  • universal_id:session_id = 1:N or N:1
  • user_id:session_id = 1:N
  • universal_id:session_id = 1:N 或 N:1

I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!

我试图在 Python 中找到一种方法(或者可能在百万行数据上使用 PySpark),以帮助我对这些数据进行聚类(或者无论在数据科学中如何称呼这个过程)。思路是创建一个 universal_id:unique_id 的映射。如果您知道如何做到这一点,请帮忙,或者至少指引我应该研究的主题,以便能够完成这个任务。谢谢!

I have Snowflake and Databricks at my disposal.
我可以使用 Snowflake 和 Databricks。

英文:

I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.

  • Sometimes user_id is not available, and in that case the other two columns should be used to create the unique id.
  • When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
  • "id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.

Here's a list of possible relationships between columns:

  • user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
  • user_id:session_id = 1:N
  • universal_id:session_id = 1:N or N:1

I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!

I have Snowflake and Databricks at my disposal.

Here's my test dataset:

import pandas as pd

data = [
    [1, 1, 'apple', 'fiat', 1],
    [2, 1, 'pear', 'bmw', 1],
    [3, 2, 'bananna', 'citroen', 2],
    [4, 3, 'bananna', 'kia', 3],
    [5, 4, 'blueberry', 'peugeot', 4],
    [6, None, 'blueberry', 'peugeot', 4],
    [7, None, 'blueberry', 'yamaha', 4],
    [8, 5, 'plum', 'ford', 5],
    [9, None, 'watermelon', 'ford', 5],
    [10, None, 'raspberry', 'honda', 6],
    [11, None, 'raspberry', 'toyota', 6],
    [12, None, 'avocado', 'mercedes', 7],
    [13, None, 'cherry', 'mercedes', 7],
    [14, None, 'apricot', 'volkswagen', 2],
    [15, 2, 'apricot', 'volkswagen', 2],
    [16, 6, 'blueberry', 'audi', 8],
    [17, None, 'blackberry', 'bmw', 1],
    [18, 7, 'plum', 'porsche', 9]
]

df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])

答案1

得分: 1

根据您描述的内容,我们可以制定如下算法,将新的ID称为global_id更新: 当多个user_id匹配多个universal_id时,算法现在具有任意的决定性因素。 更新: 由于您担心完全随机生成的UUID4可能会发生重复,我为您编写了一个小函数,允许您生成UUID,同时利用UUID1和/或UUID4 - 我个人认为不必担心UUID4值的冲突,但这取决于您。

  1. 为每个具有多次出现(n>1)的user_id创建一个新的global_id
  2. global_id的值传播到所有具有匹配的user_id的行
  3. 为每个具有单次出现(n=1)的user_id创建一个新的global_id
  4. global_id的值传播到所有具有匹配的universal_id的行,即不匹配user_id但匹配universal_id的行。如果多个universal_id在一个或多个user_id上匹配,则会进行任意的决定性因素,所有匹配的universal_id都分配给相同的user_id
  5. 为每个不能与user_id链接的具有多次出现(n>1)的universal_id创建新的global_id
  6. global_id的值传播到所有具有匹配的universal_id的行
  7. 将现有的global_id的值传播到所有具有匹配的session_id的行,即在user_iduniversal_id上都不匹配但在session_id上匹配的行
  8. 为每个不能与user_iduniversal_id链接的具有多次出现(n>1)的session_id创建新的global_id
  9. global_id的值传播到所有具有匹配的session_id的行
  10. (在您的示例中不需要,但可能有用)为每个不具有多次出现(n=1)的session_id创建新的global_id

希望这有所帮助!

英文:

Based on what you described, we can formulate an algorithm as follows, referring to the new ID as global_id. Update: The algorithm now features the arbitrary tie-break when multiple user_ids match multiple universal_ids. Update: Since you were concerned about the risk of duplicates using fully randomly generated UUID4s, I coded you a little function which allows you to generate a UUID leveraging both UUID1 and/or UUID4 - I personally would not be worried about clashes of UUID4 values whatsoever, but it's up to you.

  1. Create a new global_id for every user_id that has
    multiple occurrences
    (n>1)
  2. Propagate values for global_id to all rows with matching
    user_id
  3. Create a new global_id for every user_id that has single occurrence (n=1)
  4. Propagate values for global_id to all rows with matching universal_id, i.e. rows which don't match on user_id but match on universal_id. There is an arbitrary tie break if multiple universal_ids match on one or more user_ids, where all matching universal_ids are assigned to the same user_id
  5. Create a new global_id for every universal_id which
    cannot be linked to a user_id but has multiple occurrences
    (n>1)
  6. Propagate values for global_id to all rows with
    matching universal_id
  7. Propagate existing values for global_id to all rows with matching session_id, i.e. rows with don't match on neither user_id or universal_id but on session_id
  8. Create a new global_id for every session_id which cannot
    be linked to neither a user_id nor universal_id but has
    multiple occurrences
    (n>1)
  9. Propagate values for global_id to all rows with matching session_id
  10. (not needed in your example but might be useful): Create a
    new global_id for every session_id which does not have
    multiple occurrences (n=1)

Hope this helps!

import uuid
import pandas as pd
import numpy as np


data = [
    [1, 1, 'apple', 'fiat', 1],
    [2, 1, 'pear', 'bmw', 1],
    [3, 2, 'bananna', 'citroen', 2],
    [4, 3, 'bananna', 'kia', 3],
    [5, 4, 'blueberry', 'peugeot', 4],
    [6, None, 'blueberry', 'peugeot', 4],
    [7, None, 'blueberry', 'yamaha', 4],
    [8, 5, 'plum', 'ford', 5],
    [9, None, 'watermelon', 'ford', 5],
    [10, None, 'raspberry', 'honda', 6],
    [11, None, 'raspberry', 'toyota', 6],
    [12, None, 'avocado', 'mercedes', 7],
    [13, None, 'cherry', 'mercedes', 7],
    [14, None, 'apricot', 'volkswagen', 2],
    [15, 2, 'apricot', 'volkswagen', 2],
    [16, 6, 'blueberry', 'audi', 8],
    [17, None, 'blackberry', 'bmw', 1],
    [18, 7, 'plum', 'porsche', 9]
]


def generate_uuid(use_uuid1: bool=True, use_uuid4: bool=False) -> str:
    """Helper function creating UUIDs.
    
    Arguments:
        use_uuid1: Whether generated UUID string should feature a UUID1 part.
            Defaults to `True`.
        use_uuid1: Whether generated UUID string should feature a UUID4 part.
            Defaults to `False`.
    
    Returns:
        Universally unique identifier based on UUID1 and/or UUID4.
    """
    uuid_str = ""

    if not use_uuid1 and not use_uuid4:
        raise ValueError("Both use_uuid1 and use_uuid4 are set to `False`, cannot create UUID.")

    elif use_uuid1 and use_uuid4:
        uuid_str += f"{str(uuid.uuid1())}-{str(uuid.uuid4())}"
    elif use_uuid1:
        uuid_str += str(uuid.uuid1())
    else:
        uuid_str += str(uuid.uuid4())
    
    return uuid_str


df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])

# STEP 1
df.sort_values(by='user_id', inplace=True)
df['_same_user_id'] = (
    (df['user_id'] == df['user_id'].shift(-1))
    & (df['user_id'] != df['user_id'].shift(1))
)
df['global_id'] = [generate_uuid() if value else np.NaN for value in df['_same_user_id'].values]

# STEP 2
df['global_id'] = df.groupby('user_id')['global_id'].ffill()

# STEP 3
df['_new_ids'] = [generate_uuid() if not np.isnan(value) else np.NaN for value in df['user_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)

# # STEP 4
df.sort_values(by='universal_id', inplace=True)
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()

# STEP 5
df['_count_universal_id'] = df['universal_id'].groupby(df['universal_id']).transform('count')
df['_same_universal_id'] = (
    (df['universal_id'] == df['universal_id'].shift(-1))
    & (df['universal_id'] != df['universal_id'].shift(1))
)
df['_new_id_for_universal_id'] = (
    df['_count_universal_id'].gt(1)
    & (df['global_id'].isnull()) 
    & df['_same_universal_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_universal_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)

# STEP 6
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()


# STEP 7
df.sort_values(by='session_id', inplace=True)
df['global_id'] = df.groupby('session_id')['global_id'].ffill()

# STEP 8
df['_count_session_id'] = df['session_id'].groupby(df['session_id']).transform('count')
df['_same_session_id'] = (
    (df['session_id'] == df['session_id'].shift(-1))
    & (df['session_id'] != df['session_id'].shift(1))
)
df['_new_id_for_session_id'] = (
    df['_count_session_id'].gt(1)
    & (df['global_id'].isnull()) 
    & df['_same_session_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)

# STEP 9
df['global_id'] = df.groupby('session_id')['global_id'].ffill()

# STEP 10
df['_new_ids'] = [generate_uuid() if value == 1 else np.NaN for value in df['_count_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)

# DROP INTERNAL COLUMNS
cols_to_drop = [col for col in df.columns if col.startswith("_")]
df.drop(columns=cols_to_drop, inplace=True)

Results (since we're now using UUID1, the ID's look very similar but they are not the same).

|   id |   user_id | universal_id   | session_id   |   expected_result | global_id                            |
|-----:|----------:|:---------------|:-------------|------------------:|:-------------------------------------|
|    2 |         1 | pear           | bmw          |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
|   17 |       nan | blackberry     | bmw          |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
|    1 |         1 | apple          | fiat         |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
|    3 |         2 | bananna        | citroen      |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
|   14 |       nan | apricot        | volkswagen   |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
|   15 |         2 | apricot        | volkswagen   |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
|    4 |         3 | bananna        | kia          |                 3 | ee530d00-f0da-11ed-8f35-0242ac1c000c |
|    5 |         4 | blueberry      | peugeot      |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
|    6 |       nan | blueberry      | peugeot      |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
|    7 |       nan | blueberry      | yamaha       |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
|    9 |       nan | watermelon     | ford         |                 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
|    8 |         5 | plum           | ford         |                 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
|   16 |         6 | blueberry      | audi         |                 8 | ee53100c-f0da-11ed-8f35-0242ac1c000c |
|   18 |         7 | plum           | porsche      |                 9 | ee531106-f0da-11ed-8f35-0242ac1c000c |
|   10 |       nan | raspberry      | honda        |                 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
|   11 |       nan | raspberry      | toyota       |                 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
|   13 |       nan | cherry         | mercedes     |                 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
|   12 |       nan | avocado        | mercedes     |                 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |

huangapple
  • 本文由 发表于 2023年5月11日 00:19:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220639.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定