英文:
How to efficently flatten JSON structure returned in elasticsearch_dsl queries?
问题
I'm using elasticsearch_dsl to make queries for and searches of an elasticsearch DB.
One of the fields I'm querying is an address, which has a structure like this:
address.first_line
address.second_line
address.city
adress.code
The returned documents hold this in JSON structures, with the address stored in a dictionary, with a field for each sub-field of the address.
I would like to put this into a (pandas) dataframe, with one column per sub-field of the address.
Directly putting the address into the dataframe gives me a column of address dictionaries, and iterating through the rows to manually unpack (json.normalize()
) each address dictionary takes a long time (4 days, ~200,000 rows).
From the docs, I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?
英文:
I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.
One of the fields I'm querying is an address, which as a structure like so:
address.first_line
address.second_line
address.city
adress.code
The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.
I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.
Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()
) each address dict takes a long time (4 days, ~200,000 rows).
From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?
答案1
得分: 0
寻找解决这个问题的方法时,我找到了自己的答案,发现它不够好,因此会用更好的方法进行更新
具体来说:pd.json_normalize(df['json_column'])
在上下文中:pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)
然后根据需要删除原始列。
去年的原始答案,速度较慢地执行相同的操作
df.column_of_dicts.apply(pd.Series)
返回一个将这些字典扁平化的DataFrame。
pd.concat(df, new_df)
将新列添加到旧的数据框中。
然后删除column_of_dicts原始列。
pd.concat([df, df.address.apply(pd.Series)], axis=1)
是我使用的实际代码。
英文:
Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way
Specifically: pd.json_normalize(df['json_column'])
In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)
Then drop the original column if required.
Original answer from last year that does the same thing much more slowly
df.column_of_dicts.apply(pd.Series)
returns a DataFrame with those dicts flattened.
pd.concat(df,new_df)
gets the new columns onto the old dataframe.
Then delete the original column_of_dicts.
pd.concat([df, df.address.apply(pd.Series)], axis=1)
is the actual code I used.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论