英文:
Pandas most efficient way to set value of df when searching nested dictionary for value
问题
以下是代码部分的翻译:
我有一个包含数百万行的数据帧,我正在搜索数据帧中的列值,以检索字典列表中的键,并使用该键从元数据表中获取一个值,然后将该值设置为数据帧中的新列。
map_dict = {'AP017903.1': "['BAX03457', 'BAX03456', 'BAX03455', 'BAX03454']",
'BK013208': "['BK013208', 'BK013208', 'BK013208', 'BK013208']"}
metadata = pd.DataFrame({'ID':['AP017903.1','BK013208'], 'length':[99517,102321]})
df = pd.DataFrame({'qseqid':['BAX03457.1','BAX03457.1','BAX03456.1','BAX03455.1'],
'sseqid':['BK013208_1','BK013208_2','BK013208_3','BK013208_4']})
我的代码运行非常慢,因为我正在迭代数据帧并在原地为每行设置值。我想知道是否有人有任何关于如何加速代码的建议,或者是否我正在以一种非常低效的方式进行操作。实际上,该字典已经减小了规模,每个键实际上可以有数百个值。
for idx, row in df.iterrows():
# 用正则表达式匹配直到第一个出现的'.'或'_'之前的所有内容
qseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['qseqid']).group(0)
sseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['sseqid']).group(0)
qseqid_id = [key for key, value in map_dict.items() if qseqid_pattern in value][0]
sseqid_id = [key for key, value in map_dict.items() if sseqid_pattern in value][0]
if qseqid_id:
df.loc[idx,'qseqid_length'] = metadata[metadata['ID']==qseqid_id ]['length'].values[0]
else:
pass
if sseqid_id:
df.loc[idx,'sseqid_length'] = metadata[metadata['ID']==sseqid_id]['length'].values[0]
else:
pass
是否将所有值追加到列表(内存允许的情况下)会更快?任何想法或见解都将非常感激!鉴于这个过程非常耗时,我正在考虑尝试使用awk。
希望这对你有所帮助。如果你有其他问题或需要进一步的解释,请随时提出。
英文:
I have a dataframe with millions of rows and I'm searching for the column values of the dataframe inside the dictionary of lists to retrieve the key and use this key to get a value from a metadata table and then set that value as a new column in the df.
map_dict = {'AP017903.1': "['BAX03457', 'BAX03456', 'BAX03455', 'BAX03454']",
'BK013208': "['BK013208', 'BK013208', 'BK013208', 'BK013208']"}
metadata = pd.DataFrame({'ID':['AP017903.1','BK013208'], 'length':[99517,102321]})
df = pd.DataFrame({'qseqid':['BAX03457.1','BAX03457.1','BAX03456.1','BAX03455.1'],
'sseqid':['BK013208_1','BK013208_2','BK013208_3','BK013208_4']})
My code is working extremely slowly as I'm iterating through the dataframe and setting the value for each row in place. I'm wondering if anyone has any suggestions on how to speed up the code or if I'm doing this in a really inefficient way. The dictionary is reduced to scale and each key can have 100's of values in reality.
for idx, row in df.iterrows():
# regex to match everything up until first occurrence of '.' or '_'
qseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['qseqid']).group(0)
sseqid_pattern = re.search(r'(?:(?![\.|\_]).)*', row['sseqid']).group(0)
qseqid_id = [key for key, value in map_dict.items() if qseqid_pattern in value][0]
sseqid_id = [key for key, value in map_dict.items() if sseqid_pattern in value][0]
if qseqid_id:
df.loc[idx,'qseqid_length'] = metadata[metadata['ID']==qseqid_id ]['length'].values[0]
else:
pass
if sseqid_id:
df.loc[idx,'sseqid_length'] = metadata[metadata['ID']==sseqid_id]['length'].values[0]
else:
pass
Would it be faster to just append all the values to a list memory permitting? Any thoughts or insight greatly appreciated! I'm considering trying awk since this is taking so long.
答案1
得分: 0
看起来你只想要处理一个映射吗?只需按多个步骤构建你的映射系列。
我已经从你的 map_dict
中创建了一个快速且不够严谨的映射,如下所示。我不建议在生产环境中运行此代码,因为它使用了 eval
。但是,如果你的映射目前有效,可以将类似 str(your_list)
的输出解析为真正的 list
,然后执行以下操作。
map1 = pd.DataFrame(map_dict, index=[0, 1]).unstack().reset_index() \
.assign(l=lambda d: d[0].apply(eval)).explode('l')[['level_0', 'l']]
然后,只需将映射键与你的元数据框架合并:
more_metadata = metadata.merge(map1, left_on='ID', right_on='level_0', how='outer')
你可以通过重命名中间列名来改进它们,使它们更适合你的需求。我不知道这些特定的 ID 代表什么,所以我基本上使用了默认值和占位符。
然后,你可以这样获取长度:
other_metadata = more_metadata[['length', 'l']] \
.drop_duplicates().set_index('l')['length']
df['qseqid_len'] = df['qseqid'].str.replace('[\._].*$', '', regex=True) \
.map(other_metadata)
df['sseqid_len'] = df['sseqid'].str.replace('[\._].*$', '', regex=True) \
.map(other_metadata)
因此,最终结果如下:
qseqid sseqid qseqid_length sseqid_length qseqid_len sseqid_len
0 BAX03457.1 BK013208_1 99517.0 102321.0 99517 102321
1 BAX03457.1 BK013208_2 99517.0 102321.0 99517 102321
2 BAX03456.1 BK013208_3 99517.0 102321.0 99517 102321
3 BAX03455.1 BK013208_4 99517.0 102321.0 99517 102321
英文:
It seems you just want to do a map? Just construct your mapping series in multiple steps.
I've done a quick and dirty map from your map_dict
, as follows. I would not run this code in production because it uses eval
. However your mapping currently works, parse your output that looks like str(your_list)
into a real list
then explode.
map1 = pd.DataFrame(map_dict, index=[0, 1]).unstack().reset_index() \
.assign(l=lambda d: d[0].apply(eval)).explode('l')[['level_0', 'l']]
Then just merge the mapping key against your metadata frame:
>>> more_metadata = metadata.merge(map1, left_on='ID', right_on='level_0', how='outer')
ID length level_0 l
0 AP017903.1 99517 AP017903.1 BAX03457
1 AP017903.1 99517 AP017903.1 BAX03456
2 AP017903.1 99517 AP017903.1 BAX03455
3 AP017903.1 99517 AP017903.1 BAX03454
4 AP017903.1 99517 AP017903.1 BAX03457
5 AP017903.1 99517 AP017903.1 BAX03456
6 AP017903.1 99517 AP017903.1 BAX03455
7 AP017903.1 99517 AP017903.1 BAX03454
8 BK013208 102321 BK013208 BK013208
9 BK013208 102321 BK013208 BK013208
10 BK013208 102321 BK013208 BK013208
11 BK013208 102321 BK013208 BK013208
12 BK013208 102321 BK013208 BK013208
13 BK013208 102321 BK013208 BK013208
14 BK013208 102321 BK013208 BK013208
15 BK013208 102321 BK013208 BK013208
You can definitely improve on the intermediate column names by renaming them to things that are more valid for you. I don't know what these specific IDs represent, so I left them basically with defaults and placeholders.
Then you can get your lengths like so:
other_metadata = more_metadata[['length', 'l']] \
.drop_duplicates().set_index('l')['length']
df['qseqid_len'] = df['qseqid'].str.replace('[\._].*$', '', regex=True) \
.map(other_metadata)
df['sseqid_len'] = df['sseqid'].str.replace('[\._].*$', '', regex=True) \
.map(other_metadata)
Thus:
qseqid sseqid qseqid_length sseqid_length qseqid_len sseqid_len
0 BAX03457.1 BK013208_1 99517.0 102321.0 99517 102321
1 BAX03457.1 BK013208_2 99517.0 102321.0 99517 102321
2 BAX03456.1 BK013208_3 99517.0 102321.0 99517 102321
3 BAX03455.1 BK013208_4 99517.0 102321.0 99517 102321
答案2
得分: 0
我添加了代码来识别您的值是字符串还是列表,然后相应地构建了反转的字典。所以它应该可以适用于两种情况。
首先,您需要使用str.extract
和正则表达式来提取在 ., _
之前的部分,然后使用 series.map
。
代码:
reversed_dict = {}
is_value_str = isinstance(next(iter(map_dict.values())), str)
if is_value_str:
from ast import literal_eval
reversed_dict = {e: k for k, v in map_dict.items() for e in literal_eval(v)}
else:
reversed_dict = {e: k for k, v in map_dict.items() for e in v}
get_length = (
lambda s: s.str.extract(r"(.*?)[\._]", expand=False)
.map(reversed_dict)
.map(metadata.set_index("ID")["length"])
)
cols = df.filter(regex="seqid").columns
length_cols = cols.map("{}_length".format)
df[length_cols] = df[cols].apply(get_length)
print(df)
qseqid sseqid qseqid_length sseqid_length
0 BAX03457.1 BK013208_1 99517 102321
1 BAX03457.1 BK013208_2 99517 102321
2 BAX03456.1 BK013208_3 99517 102321
3 BAX03455.1 BK013208_4 99517 102321
英文:
I added code to identify whether your values are strings or lists and then build reversed dict accordingly. So it should work either way.
First you need to extract the parts that are before ., _
using str.extract
and a regex and then series.map
.
Code:
reversed_dict = {}
is_value_str = isinstance(next(iter(map_dict.values())), str)
if is_value_str:
from ast import literal_eval
reversed_dict = {e: k for k, v in map_dict.items() for e in literal_eval(v)}
else:
reversed_dict = {e: k for k, v in map_dict.items() for e in v}
get_length = (
lambda s: s.str.extract(r"(.*?)[\._]", expand=False)
.map(reversed_dict)
.map(metadata.set_index("ID")["length"])
)
cols = df.filter(regex="seqid").columns
length_cols = cols.map("{}_length".format)
df[length_cols] = df[cols].apply(get_length)
print(df)
qseqid sseqid qseqid_length sseqid_length
0 BAX03457.1 BK013208_1 99517 102321
1 BAX03457.1 BK013208_2 99517 102321
2 BAX03456.1 BK013208_3 99517 102321
3 BAX03455.1 BK013208_4 99517 102321
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论