英文:
Entity resolution - creating a unique identifier based on 3 columns
问题
I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id, and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.
有三列(user_id、universal_id 和 session_id)要创建一个唯一标识符,"expected_result"列是在处理其他三列后应该得到的唯一标识符。
-
Sometimes user_id is not available, and in that case, the other two columns should be used to create the unique id.
-
有时候 user_id 不可用,在这种情况下,应使用其他两列来创建唯一标识符。
-
When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
-
当 user_id 没有匹配项而 universal_id 有匹配项时,它们应被视为不同的(单独的唯一标识符)。
-
"id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.
-
"id" 列是数据写入数据库的顺序。如果出现一个新行与之前的任何行(已经计算出唯一标识符)匹配,无论是哪三列,已经存在的唯一标识符应该被添加到新行。
Here's a list of possible relationships between columns:
以下是列之间可能的关系列表:
- user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
- user_id:universal_id = 1:N 或 N:1(如果是 N:1,那么每个 N 需要一个唯一标识符)
- user_id:session_id = 1:N
- universal_id:session_id = 1:N or N:1
- user_id:session_id = 1:N
- universal_id:session_id = 1:N 或 N:1
I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!
我试图在 Python 中找到一种方法(或者可能在百万行数据上使用 PySpark),以帮助我对这些数据进行聚类(或者无论在数据科学中如何称呼这个过程)。思路是创建一个 universal_id:unique_id 的映射。如果您知道如何做到这一点,请帮忙,或者至少指引我应该研究的主题,以便能够完成这个任务。谢谢!
I have Snowflake and Databricks at my disposal.
我可以使用 Snowflake 和 Databricks。
英文:
I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.
- Sometimes user_id is not available, and in that case the other two columns should be used to create the unique id.
- When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
- "id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.
Here's a list of possible relationships between columns:
- user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
- user_id:session_id = 1:N
- universal_id:session_id = 1:N or N:1
I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!
I have Snowflake and Databricks at my disposal.
Here's my test dataset:
import pandas as pd
data = [
[1, 1, 'apple', 'fiat', 1],
[2, 1, 'pear', 'bmw', 1],
[3, 2, 'bananna', 'citroen', 2],
[4, 3, 'bananna', 'kia', 3],
[5, 4, 'blueberry', 'peugeot', 4],
[6, None, 'blueberry', 'peugeot', 4],
[7, None, 'blueberry', 'yamaha', 4],
[8, 5, 'plum', 'ford', 5],
[9, None, 'watermelon', 'ford', 5],
[10, None, 'raspberry', 'honda', 6],
[11, None, 'raspberry', 'toyota', 6],
[12, None, 'avocado', 'mercedes', 7],
[13, None, 'cherry', 'mercedes', 7],
[14, None, 'apricot', 'volkswagen', 2],
[15, 2, 'apricot', 'volkswagen', 2],
[16, 6, 'blueberry', 'audi', 8],
[17, None, 'blackberry', 'bmw', 1],
[18, 7, 'plum', 'porsche', 9]
]
df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])
答案1
得分: 1
根据您描述的内容,我们可以制定如下算法,将新的ID称为global_id。 更新: 当多个user_id匹配多个universal_id时,算法现在具有任意的决定性因素。 更新: 由于您担心完全随机生成的UUID4可能会发生重复,我为您编写了一个小函数,允许您生成UUID,同时利用UUID1和/或UUID4 - 我个人认为不必担心UUID4值的冲突,但这取决于您。
- 为每个具有多次出现(n>1)的
user_id创建一个新的global_id - 将
global_id的值传播到所有具有匹配的user_id的行 - 为每个具有单次出现(n=1)的
user_id创建一个新的global_id - 将
global_id的值传播到所有具有匹配的universal_id的行,即不匹配user_id但匹配universal_id的行。如果多个universal_id在一个或多个user_id上匹配,则会进行任意的决定性因素,所有匹配的universal_id都分配给相同的user_id - 为每个不能与
user_id链接的具有多次出现(n>1)的universal_id创建新的global_id - 将
global_id的值传播到所有具有匹配的universal_id的行 - 将现有的
global_id的值传播到所有具有匹配的session_id的行,即在user_id或universal_id上都不匹配但在session_id上匹配的行 - 为每个不能与
user_id或universal_id链接的具有多次出现(n>1)的session_id创建新的global_id - 将
global_id的值传播到所有具有匹配的session_id的行 - (在您的示例中不需要,但可能有用)为每个不具有多次出现(n=1)的
session_id创建新的global_id
希望这有所帮助!
英文:
Based on what you described, we can formulate an algorithm as follows, referring to the new ID as global_id. Update: The algorithm now features the arbitrary tie-break when multiple user_ids match multiple universal_ids. Update: Since you were concerned about the risk of duplicates using fully randomly generated UUID4s, I coded you a little function which allows you to generate a UUID leveraging both UUID1 and/or UUID4 - I personally would not be worried about clashes of UUID4 values whatsoever, but it's up to you.
- Create a new
global_idfor everyuser_idthat has
multiple occurrences (n>1) - Propagate values for
global_idto all rows with matching
user_id - Create a new
global_idfor everyuser_idthat has single occurrence (n=1) - Propagate values for
global_idto all rows with matchinguniversal_id, i.e. rows which don't match onuser_idbut match onuniversal_id. There is an arbitrary tie break if multipleuniversal_ids match on one or moreuser_ids, where all matchinguniversal_ids are assigned to the sameuser_id - Create a new
global_idfor everyuniversal_idwhich
cannot be linked to auser_idbut has multiple occurrences
(n>1) - Propagate values for
global_idto all rows with
matchinguniversal_id - Propagate existing values for
global_idto all rows with matchingsession_id, i.e. rows with don't match on neitheruser_idoruniversal_idbut onsession_id - Create a new
global_idfor everysession_idwhich cannot
be linked to neither auser_idnoruniversal_idbut has
multiple occurrences (n>1) - Propagate values for
global_idto all rows with matchingsession_id - (not needed in your example but might be useful): Create a
newglobal_idfor every session_id which does not have
multiple occurrences (n=1)
Hope this helps!
import uuid
import pandas as pd
import numpy as np
data = [
[1, 1, 'apple', 'fiat', 1],
[2, 1, 'pear', 'bmw', 1],
[3, 2, 'bananna', 'citroen', 2],
[4, 3, 'bananna', 'kia', 3],
[5, 4, 'blueberry', 'peugeot', 4],
[6, None, 'blueberry', 'peugeot', 4],
[7, None, 'blueberry', 'yamaha', 4],
[8, 5, 'plum', 'ford', 5],
[9, None, 'watermelon', 'ford', 5],
[10, None, 'raspberry', 'honda', 6],
[11, None, 'raspberry', 'toyota', 6],
[12, None, 'avocado', 'mercedes', 7],
[13, None, 'cherry', 'mercedes', 7],
[14, None, 'apricot', 'volkswagen', 2],
[15, 2, 'apricot', 'volkswagen', 2],
[16, 6, 'blueberry', 'audi', 8],
[17, None, 'blackberry', 'bmw', 1],
[18, 7, 'plum', 'porsche', 9]
]
def generate_uuid(use_uuid1: bool=True, use_uuid4: bool=False) -> str:
"""Helper function creating UUIDs.
Arguments:
use_uuid1: Whether generated UUID string should feature a UUID1 part.
Defaults to `True`.
use_uuid1: Whether generated UUID string should feature a UUID4 part.
Defaults to `False`.
Returns:
Universally unique identifier based on UUID1 and/or UUID4.
"""
uuid_str = ""
if not use_uuid1 and not use_uuid4:
raise ValueError("Both use_uuid1 and use_uuid4 are set to `False`, cannot create UUID.")
elif use_uuid1 and use_uuid4:
uuid_str += f"{str(uuid.uuid1())}-{str(uuid.uuid4())}"
elif use_uuid1:
uuid_str += str(uuid.uuid1())
else:
uuid_str += str(uuid.uuid4())
return uuid_str
df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])
# STEP 1
df.sort_values(by='user_id', inplace=True)
df['_same_user_id'] = (
(df['user_id'] == df['user_id'].shift(-1))
& (df['user_id'] != df['user_id'].shift(1))
)
df['global_id'] = [generate_uuid() if value else np.NaN for value in df['_same_user_id'].values]
# STEP 2
df['global_id'] = df.groupby('user_id')['global_id'].ffill()
# STEP 3
df['_new_ids'] = [generate_uuid() if not np.isnan(value) else np.NaN for value in df['user_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# # STEP 4
df.sort_values(by='universal_id', inplace=True)
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
# STEP 5
df['_count_universal_id'] = df['universal_id'].groupby(df['universal_id']).transform('count')
df['_same_universal_id'] = (
(df['universal_id'] == df['universal_id'].shift(-1))
& (df['universal_id'] != df['universal_id'].shift(1))
)
df['_new_id_for_universal_id'] = (
df['_count_universal_id'].gt(1)
& (df['global_id'].isnull())
& df['_same_universal_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_universal_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# STEP 6
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
# STEP 7
df.sort_values(by='session_id', inplace=True)
df['global_id'] = df.groupby('session_id')['global_id'].ffill()
# STEP 8
df['_count_session_id'] = df['session_id'].groupby(df['session_id']).transform('count')
df['_same_session_id'] = (
(df['session_id'] == df['session_id'].shift(-1))
& (df['session_id'] != df['session_id'].shift(1))
)
df['_new_id_for_session_id'] = (
df['_count_session_id'].gt(1)
& (df['global_id'].isnull())
& df['_same_session_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# STEP 9
df['global_id'] = df.groupby('session_id')['global_id'].ffill()
# STEP 10
df['_new_ids'] = [generate_uuid() if value == 1 else np.NaN for value in df['_count_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# DROP INTERNAL COLUMNS
cols_to_drop = [col for col in df.columns if col.startswith("_")]
df.drop(columns=cols_to_drop, inplace=True)
Results (since we're now using UUID1, the ID's look very similar but they are not the same).
| id | user_id | universal_id | session_id | expected_result | global_id |
|-----:|----------:|:---------------|:-------------|------------------:|:-------------------------------------|
| 2 | 1 | pear | bmw | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 17 | nan | blackberry | bmw | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 1 | 1 | apple | fiat | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 3 | 2 | bananna | citroen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 14 | nan | apricot | volkswagen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 15 | 2 | apricot | volkswagen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 4 | 3 | bananna | kia | 3 | ee530d00-f0da-11ed-8f35-0242ac1c000c |
| 5 | 4 | blueberry | peugeot | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 6 | nan | blueberry | peugeot | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 7 | nan | blueberry | yamaha | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 9 | nan | watermelon | ford | 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
| 8 | 5 | plum | ford | 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
| 16 | 6 | blueberry | audi | 8 | ee53100c-f0da-11ed-8f35-0242ac1c000c |
| 18 | 7 | plum | porsche | 9 | ee531106-f0da-11ed-8f35-0242ac1c000c |
| 10 | nan | raspberry | honda | 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
| 11 | nan | raspberry | toyota | 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
| 13 | nan | cherry | mercedes | 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
| 12 | nan | avocado | mercedes | 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论