Pandas设置DataFrame值时搜索嵌套字典以获取数值的最有效方式

huangapple go评论69阅读模式
英文:

Pandas most efficient way to set value of df when searching nested dictionary for value

问题

以下是代码部分的翻译:

我有一个包含数百万行的数据帧我正在搜索数据帧中的列值以检索字典列表中的键并使用该键从元数据表中获取一个值然后将该值设置为数据帧中的新列

map_dict = {'AP017903.1': "['BAX03457', 'BAX03456', 'BAX03455', 'BAX03454']",
           'BK013208': "['BK013208', 'BK013208', 'BK013208', 'BK013208']"}

metadata = pd.DataFrame({'ID':['AP017903.1','BK013208'], 'length':[99517,102321]})

df = pd.DataFrame({'qseqid':['BAX03457.1','BAX03457.1','BAX03456.1','BAX03455.1'],
                       'sseqid':['BK013208_1','BK013208_2','BK013208_3','BK013208_4']})

我的代码运行非常慢因为我正在迭代数据帧并在原地为每行设置值我想知道是否有人有任何关于如何加速代码的建议或者是否我正在以一种非常低效的方式进行操作实际上该字典已经减小了规模每个键实际上可以有数百个值

for idx, row in df.iterrows():
    # 用正则表达式匹配直到第一个出现的'.'或'_'之前的所有内容
    qseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['qseqid']).group(0)
    sseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['sseqid']).group(0)
    qseqid_id = [key for key, value in map_dict.items() if qseqid_pattern in value][0]
    sseqid_id = [key for key, value in map_dict.items() if sseqid_pattern in value][0]
    if qseqid_id:
          df.loc[idx,'qseqid_length'] = metadata[metadata['ID']==qseqid_id ]['length'].values[0]
    else:
        pass
    if sseqid_id: 
          df.loc[idx,'sseqid_length'] = metadata[metadata['ID']==sseqid_id]['length'].values[0]
    else:
        pass

是否将所有值追加到列表内存允许的情况下会更快任何想法或见解都将非常感激鉴于这个过程非常耗时我正在考虑尝试使用awk

希望这对你有所帮助。如果你有其他问题或需要进一步的解释,请随时提出。

英文:

I have a dataframe with millions of rows and I'm searching for the column values of the dataframe inside the dictionary of lists to retrieve the key and use this key to get a value from a metadata table and then set that value as a new column in the df.

map_dict = {'AP017903.1': "['BAX03457', 'BAX03456', 'BAX03455', 'BAX03454']",
       'BK013208': "['BK013208', 'BK013208', 'BK013208', 'BK013208']"}

metadata = pd.DataFrame({'ID':['AP017903.1','BK013208'], 'length':[99517,102321]})

df = pd.DataFrame({'qseqid':['BAX03457.1','BAX03457.1','BAX03456.1','BAX03455.1'],
                   'sseqid':['BK013208_1','BK013208_2','BK013208_3','BK013208_4']})

My code is working extremely slowly as I'm iterating through the dataframe and setting the value for each row in place. I'm wondering if anyone has any suggestions on how to speed up the code or if I'm doing this in a really inefficient way. The dictionary is reduced to scale and each key can have 100's of values in reality.

for idx, row in df.iterrows():
    # regex to match everything up until first occurrence of '.' or '_'
    qseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['qseqid']).group(0)
    sseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['sseqid']).group(0)
    qseqid_id = [key for key, value in map_dict.items() if qseqid_pattern in value][0]
    sseqid_id = [key for key, value in map_dict.items() if sseqid_pattern in value][0]
    if qseqid_id:
          df.loc[idx,'qseqid_length'] = metadata[metadata['ID']==qseqid_id ]['length'].values[0]
    else:
        pass
    if sseqid_id: 
          df.loc[idx,'sseqid_length'] = metadata[metadata['ID']==sseqid_id]['length'].values[0]
    else:
        pass

Would it be faster to just append all the values to a list memory permitting? Any thoughts or insight greatly appreciated! I'm considering trying awk since this is taking so long.

答案1

得分: 0

看起来你只想要处理一个映射吗?只需按多个步骤构建你的映射系列。

我已经从你的 map_dict 中创建了一个快速且不够严谨的映射,如下所示。我不建议在生产环境中运行此代码,因为它使用了 eval。但是,如果你的映射目前有效,可以将类似 str(your_list) 的输出解析为真正的 list,然后执行以下操作。

map1 = pd.DataFrame(map_dict, index=[0, 1]).unstack().reset_index() \
    .assign(l=lambda d: d[0].apply(eval)).explode('l')[['level_0', 'l']]

然后,只需将映射键与你的元数据框架合并:

more_metadata = metadata.merge(map1, left_on='ID', right_on='level_0', how='outer')

你可以通过重命名中间列名来改进它们,使它们更适合你的需求。我不知道这些特定的 ID 代表什么,所以我基本上使用了默认值和占位符。

然后,你可以这样获取长度:

other_metadata = more_metadata[['length', 'l']] \
    .drop_duplicates().set_index('l')['length']
df['qseqid_len'] = df['qseqid'].str.replace('[\._].*$', '', regex=True) \
    .map(other_metadata)
df['sseqid_len'] = df['sseqid'].str.replace('[\._].*$', '', regex=True) \
    .map(other_metadata)

因此,最终结果如下:

       qseqid      sseqid  qseqid_length  sseqid_length  qseqid_len  sseqid_len
0  BAX03457.1  BK013208_1        99517.0       102321.0       99517      102321
1  BAX03457.1  BK013208_2        99517.0       102321.0       99517      102321
2  BAX03456.1  BK013208_3        99517.0       102321.0       99517      102321
3  BAX03455.1  BK013208_4        99517.0       102321.0       99517      102321
英文:

It seems you just want to do a map? Just construct your mapping series in multiple steps.

I've done a quick and dirty map from your map_dict, as follows. I would not run this code in production because it uses eval. However your mapping currently works, parse your output that looks like str(your_list) into a real list then explode.

map1 = pd.DataFrame(map_dict, index=[0, 1]).unstack().reset_index() \
    .assign(l=lambda d: d[0].apply(eval)).explode('l')[['level_0', 'l']]

Then just merge the mapping key against your metadata frame:

>>> more_metadata = metadata.merge(map1, left_on='ID', right_on='level_0', how='outer')
           ID  length     level_0         l
0   AP017903.1   99517  AP017903.1  BAX03457
1   AP017903.1   99517  AP017903.1  BAX03456
2   AP017903.1   99517  AP017903.1  BAX03455
3   AP017903.1   99517  AP017903.1  BAX03454
4   AP017903.1   99517  AP017903.1  BAX03457
5   AP017903.1   99517  AP017903.1  BAX03456
6   AP017903.1   99517  AP017903.1  BAX03455
7   AP017903.1   99517  AP017903.1  BAX03454
8     BK013208  102321    BK013208  BK013208
9     BK013208  102321    BK013208  BK013208
10    BK013208  102321    BK013208  BK013208
11    BK013208  102321    BK013208  BK013208
12    BK013208  102321    BK013208  BK013208
13    BK013208  102321    BK013208  BK013208
14    BK013208  102321    BK013208  BK013208
15    BK013208  102321    BK013208  BK013208

You can definitely improve on the intermediate column names by renaming them to things that are more valid for you. I don't know what these specific IDs represent, so I left them basically with defaults and placeholders.

Then you can get your lengths like so:

other_metadata = more_metadata[['length', 'l']] \
    .drop_duplicates().set_index('l')['length']
df['qseqid_len'] = df['qseqid'].str.replace('[\._].*$', '', regex=True) \
    .map(other_metadata)
df['sseqid_len'] = df['sseqid'].str.replace('[\._].*$', '', regex=True) \
    .map(other_metadata)

Thus:

       qseqid      sseqid  qseqid_length  sseqid_length  qseqid_len  sseqid_len
0  BAX03457.1  BK013208_1        99517.0       102321.0       99517      102321
1  BAX03457.1  BK013208_2        99517.0       102321.0       99517      102321
2  BAX03456.1  BK013208_3        99517.0       102321.0       99517      102321
3  BAX03455.1  BK013208_4        99517.0       102321.0       99517      102321

答案2

得分: 0

我添加了代码来识别您的值是字符串还是列表,然后相应地构建了反转的字典。所以它应该可以适用于两种情况。

首先,您需要使用str.extract和正则表达式来提取在 ., _ 之前的部分,然后使用 series.map

代码:

reversed_dict = {}
is_value_str = isinstance(next(iter(map_dict.values())), str)
if is_value_str:
    from ast import literal_eval
    reversed_dict = {e: k for k, v in map_dict.items() for e in literal_eval(v)}
else:
    reversed_dict = {e: k for k, v in map_dict.items() for e in v}
get_length = (
    lambda s: s.str.extract(r"(.*?)[\._]", expand=False)
    .map(reversed_dict)
    .map(metadata.set_index("ID")["length"])
)
cols = df.filter(regex="seqid").columns
length_cols = cols.map("{}_length".format)
df[length_cols] = df[cols].apply(get_length)

print(df)

       qseqid      sseqid  qseqid_length  sseqid_length
0  BAX03457.1  BK013208_1          99517         102321
1  BAX03457.1  BK013208_2          99517         102321
2  BAX03456.1  BK013208_3          99517         102321
3  BAX03455.1  BK013208_4          99517         102321
英文:

I added code to identify whether your values are strings or lists and then build reversed dict accordingly. So it should work either way.

First you need to extract the parts that are before ., _ using str.extract and a regex and then series.map.

Code:

reversed_dict = {}
is_value_str = isinstance(next(iter(map_dict.values())), str)
if is_value_str:
    from ast import literal_eval
    reversed_dict = {e: k for k, v in map_dict.items() for e in literal_eval(v)}
else:
    reversed_dict = {e: k for k, v in map_dict.items() for e in v}
get_length = (
    lambda s: s.str.extract(r"(.*?)[\._]", expand=False)
    .map(reversed_dict)
    .map(metadata.set_index("ID")["length"])
)
cols = df.filter(regex="seqid").columns
length_cols = cols.map("{}_length".format)
df[length_cols] = df[cols].apply(get_length)


print(df)

       qseqid      sseqid  qseqid_length  sseqid_length
0  BAX03457.1  BK013208_1          99517         102321
1  BAX03457.1  BK013208_2          99517         102321
2  BAX03456.1  BK013208_3          99517         102321
3  BAX03455.1  BK013208_4          99517         102321

huangapple
  • 本文由 发表于 2023年2月18日 00:16:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75486759.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定