英文:
Entity resolution - creating a unique identifier based on 3 columns
问题
I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id, and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.
有三列(user_id、universal_id 和 session_id)要创建一个唯一标识符,"expected_result"列是在处理其他三列后应该得到的唯一标识符。
-
Sometimes user_id is not available, and in that case, the other two columns should be used to create the unique id.
-
有时候 user_id 不可用,在这种情况下,应使用其他两列来创建唯一标识符。
-
When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
-
当 user_id 没有匹配项而 universal_id 有匹配项时,它们应被视为不同的(单独的唯一标识符)。
-
"id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.
-
"id" 列是数据写入数据库的顺序。如果出现一个新行与之前的任何行(已经计算出唯一标识符)匹配,无论是哪三列,已经存在的唯一标识符应该被添加到新行。
Here's a list of possible relationships between columns:
以下是列之间可能的关系列表:
- user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
- user_id:universal_id = 1:N 或 N:1(如果是 N:1,那么每个 N 需要一个唯一标识符)
- user_id:session_id = 1:N
- universal_id:session_id = 1:N or N:1
- user_id:session_id = 1:N
- universal_id:session_id = 1:N 或 N:1
I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!
我试图在 Python 中找到一种方法(或者可能在百万行数据上使用 PySpark),以帮助我对这些数据进行聚类(或者无论在数据科学中如何称呼这个过程)。思路是创建一个 universal_id:unique_id 的映射。如果您知道如何做到这一点,请帮忙,或者至少指引我应该研究的主题,以便能够完成这个任务。谢谢!
I have Snowflake and Databricks at my disposal.
我可以使用 Snowflake 和 Databricks。
英文:
I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.
- Sometimes user_id is not available, and in that case the other two columns should be used to create the unique id.
- When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
- "id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.
Here's a list of possible relationships between columns:
- user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
- user_id:session_id = 1:N
- universal_id:session_id = 1:N or N:1
I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!
I have Snowflake and Databricks at my disposal.
Here's my test dataset:
import pandas as pd
data = [
[1, 1, 'apple', 'fiat', 1],
[2, 1, 'pear', 'bmw', 1],
[3, 2, 'bananna', 'citroen', 2],
[4, 3, 'bananna', 'kia', 3],
[5, 4, 'blueberry', 'peugeot', 4],
[6, None, 'blueberry', 'peugeot', 4],
[7, None, 'blueberry', 'yamaha', 4],
[8, 5, 'plum', 'ford', 5],
[9, None, 'watermelon', 'ford', 5],
[10, None, 'raspberry', 'honda', 6],
[11, None, 'raspberry', 'toyota', 6],
[12, None, 'avocado', 'mercedes', 7],
[13, None, 'cherry', 'mercedes', 7],
[14, None, 'apricot', 'volkswagen', 2],
[15, 2, 'apricot', 'volkswagen', 2],
[16, 6, 'blueberry', 'audi', 8],
[17, None, 'blackberry', 'bmw', 1],
[18, 7, 'plum', 'porsche', 9]
]
df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])
答案1
得分: 1
根据您描述的内容,我们可以制定如下算法,将新的ID称为global_id
。 更新: 当多个user_id
匹配多个universal_id
时,算法现在具有任意的决定性因素。 更新: 由于您担心完全随机生成的UUID4可能会发生重复,我为您编写了一个小函数,允许您生成UUID,同时利用UUID1和/或UUID4 - 我个人认为不必担心UUID4值的冲突,但这取决于您。
- 为每个具有多次出现(n>1)的
user_id
创建一个新的global_id
- 将
global_id
的值传播到所有具有匹配的user_id
的行 - 为每个具有单次出现(n=1)的
user_id
创建一个新的global_id
- 将
global_id
的值传播到所有具有匹配的universal_id
的行,即不匹配user_id
但匹配universal_id
的行。如果多个universal_id
在一个或多个user_id
上匹配,则会进行任意的决定性因素,所有匹配的universal_id
都分配给相同的user_id
- 为每个不能与
user_id
链接的具有多次出现(n>1)的universal_id
创建新的global_id
- 将
global_id
的值传播到所有具有匹配的universal_id
的行 - 将现有的
global_id
的值传播到所有具有匹配的session_id
的行,即在user_id
或universal_id
上都不匹配但在session_id
上匹配的行 - 为每个不能与
user_id
或universal_id
链接的具有多次出现(n>1)的session_id
创建新的global_id
- 将
global_id
的值传播到所有具有匹配的session_id
的行 - (在您的示例中不需要,但可能有用)为每个不具有多次出现(n=1)的
session_id
创建新的global_id
希望这有所帮助!
英文:
Based on what you described, we can formulate an algorithm as follows, referring to the new ID as global_id
. Update: The algorithm now features the arbitrary tie-break when multiple user_id
s match multiple universal_id
s. Update: Since you were concerned about the risk of duplicates using fully randomly generated UUID4s, I coded you a little function which allows you to generate a UUID leveraging both UUID1 and/or UUID4 - I personally would not be worried about clashes of UUID4 values whatsoever, but it's up to you.
- Create a new
global_id
for everyuser_id
that has
multiple occurrences (n>1) - Propagate values for
global_id
to all rows with matching
user_id
- Create a new
global_id
for everyuser_id
that has single occurrence (n=1) - Propagate values for
global_id
to all rows with matchinguniversal_id
, i.e. rows which don't match onuser_id
but match onuniversal_id
. There is an arbitrary tie break if multipleuniversal_id
s match on one or moreuser_id
s, where all matchinguniversal_id
s are assigned to the sameuser_id
- Create a new
global_id
for everyuniversal_id
which
cannot be linked to auser_id
but has multiple occurrences
(n>1) - Propagate values for
global_id
to all rows with
matchinguniversal_id
- Propagate existing values for
global_id
to all rows with matchingsession_id
, i.e. rows with don't match on neitheruser_id
oruniversal_id
but onsession_id
- Create a new
global_id
for everysession_id
which cannot
be linked to neither auser_id
noruniversal_id
but has
multiple occurrences (n>1) - Propagate values for
global_id
to all rows with matchingsession_id
- (not needed in your example but might be useful): Create a
newglobal_id
for every session_id which does not have
multiple occurrences (n=1)
Hope this helps!
import uuid
import pandas as pd
import numpy as np
data = [
[1, 1, 'apple', 'fiat', 1],
[2, 1, 'pear', 'bmw', 1],
[3, 2, 'bananna', 'citroen', 2],
[4, 3, 'bananna', 'kia', 3],
[5, 4, 'blueberry', 'peugeot', 4],
[6, None, 'blueberry', 'peugeot', 4],
[7, None, 'blueberry', 'yamaha', 4],
[8, 5, 'plum', 'ford', 5],
[9, None, 'watermelon', 'ford', 5],
[10, None, 'raspberry', 'honda', 6],
[11, None, 'raspberry', 'toyota', 6],
[12, None, 'avocado', 'mercedes', 7],
[13, None, 'cherry', 'mercedes', 7],
[14, None, 'apricot', 'volkswagen', 2],
[15, 2, 'apricot', 'volkswagen', 2],
[16, 6, 'blueberry', 'audi', 8],
[17, None, 'blackberry', 'bmw', 1],
[18, 7, 'plum', 'porsche', 9]
]
def generate_uuid(use_uuid1: bool=True, use_uuid4: bool=False) -> str:
"""Helper function creating UUIDs.
Arguments:
use_uuid1: Whether generated UUID string should feature a UUID1 part.
Defaults to `True`.
use_uuid1: Whether generated UUID string should feature a UUID4 part.
Defaults to `False`.
Returns:
Universally unique identifier based on UUID1 and/or UUID4.
"""
uuid_str = ""
if not use_uuid1 and not use_uuid4:
raise ValueError("Both use_uuid1 and use_uuid4 are set to `False`, cannot create UUID.")
elif use_uuid1 and use_uuid4:
uuid_str += f"{str(uuid.uuid1())}-{str(uuid.uuid4())}"
elif use_uuid1:
uuid_str += str(uuid.uuid1())
else:
uuid_str += str(uuid.uuid4())
return uuid_str
df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])
# STEP 1
df.sort_values(by='user_id', inplace=True)
df['_same_user_id'] = (
(df['user_id'] == df['user_id'].shift(-1))
& (df['user_id'] != df['user_id'].shift(1))
)
df['global_id'] = [generate_uuid() if value else np.NaN for value in df['_same_user_id'].values]
# STEP 2
df['global_id'] = df.groupby('user_id')['global_id'].ffill()
# STEP 3
df['_new_ids'] = [generate_uuid() if not np.isnan(value) else np.NaN for value in df['user_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# # STEP 4
df.sort_values(by='universal_id', inplace=True)
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
# STEP 5
df['_count_universal_id'] = df['universal_id'].groupby(df['universal_id']).transform('count')
df['_same_universal_id'] = (
(df['universal_id'] == df['universal_id'].shift(-1))
& (df['universal_id'] != df['universal_id'].shift(1))
)
df['_new_id_for_universal_id'] = (
df['_count_universal_id'].gt(1)
& (df['global_id'].isnull())
& df['_same_universal_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_universal_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# STEP 6
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
# STEP 7
df.sort_values(by='session_id', inplace=True)
df['global_id'] = df.groupby('session_id')['global_id'].ffill()
# STEP 8
df['_count_session_id'] = df['session_id'].groupby(df['session_id']).transform('count')
df['_same_session_id'] = (
(df['session_id'] == df['session_id'].shift(-1))
& (df['session_id'] != df['session_id'].shift(1))
)
df['_new_id_for_session_id'] = (
df['_count_session_id'].gt(1)
& (df['global_id'].isnull())
& df['_same_session_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# STEP 9
df['global_id'] = df.groupby('session_id')['global_id'].ffill()
# STEP 10
df['_new_ids'] = [generate_uuid() if value == 1 else np.NaN for value in df['_count_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# DROP INTERNAL COLUMNS
cols_to_drop = [col for col in df.columns if col.startswith("_")]
df.drop(columns=cols_to_drop, inplace=True)
Results (since we're now using UUID1, the ID's look very similar but they are not the same).
| id | user_id | universal_id | session_id | expected_result | global_id |
|-----:|----------:|:---------------|:-------------|------------------:|:-------------------------------------|
| 2 | 1 | pear | bmw | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 17 | nan | blackberry | bmw | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 1 | 1 | apple | fiat | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 3 | 2 | bananna | citroen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 14 | nan | apricot | volkswagen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 15 | 2 | apricot | volkswagen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 4 | 3 | bananna | kia | 3 | ee530d00-f0da-11ed-8f35-0242ac1c000c |
| 5 | 4 | blueberry | peugeot | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 6 | nan | blueberry | peugeot | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 7 | nan | blueberry | yamaha | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 9 | nan | watermelon | ford | 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
| 8 | 5 | plum | ford | 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
| 16 | 6 | blueberry | audi | 8 | ee53100c-f0da-11ed-8f35-0242ac1c000c |
| 18 | 7 | plum | porsche | 9 | ee531106-f0da-11ed-8f35-0242ac1c000c |
| 10 | nan | raspberry | honda | 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
| 11 | nan | raspberry | toyota | 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
| 13 | nan | cherry | mercedes | 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
| 12 | nan | avocado | mercedes | 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论