2020年1月6日 22:57:04go评论90阅读模式

英文:

How to efficently flatten JSON structure returned in elasticsearch_dsl queries?

问题

I'm using elasticsearch_dsl to make queries for and searches of an elasticsearch DB.

One of the fields I'm querying is an address, which has a structure like this:

address.first_line
address.second_line
address.city
adress.code

The returned documents hold this in JSON structures, with the address stored in a dictionary, with a field for each sub-field of the address.

I would like to put this into a (pandas) dataframe, with one column per sub-field of the address.

Directly putting the address into the dataframe gives me a column of address dictionaries, and iterating through the rows to manually unpack (json.normalize()) each address dictionary takes a long time (4 days, ~200,000 rows).

From the docs, I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?

英文:

I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.

One of the fields I'm querying is an address, which as a structure like so:

address.first_line
address.second_line
address.city
adress.code

The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.

I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.

Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()) each address dict takes a long time (4 days, ~200,000 rows).

From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?

答案1

得分: 0

寻找解决这个问题的方法时，我找到了自己的答案，发现它不够好，因此会用更好的方法进行更新

具体来说：pd.json_normalize(df['json_column'])

在上下文中：pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)

然后根据需要删除原始列。

去年的原始答案，速度较慢地执行相同的操作

df.column_of_dicts.apply(pd.Series) 返回一个将这些字典扁平化的DataFrame。

pd.concat(df, new_df) 将新列添加到旧的数据框中。

然后删除column_of_dicts原始列。

pd.concat([df, df.address.apply(pd.Series)], axis=1) 是我使用的实际代码。

英文:

Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way

Specifically: pd.json_normalize(df['json_column'])

In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)

Then drop the original column if required.

Original answer from last year that does the same thing much more slowly

df.column_of_dicts.apply(pd.Series) returns a DataFrame with those dicts flattened.

pd.concat(df,new_df) gets the new columns onto the old dataframe.

Then delete the original column_of_dicts.

pd.concat([df, df.address.apply(pd.Series)], axis=1) is the actual code I used.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何高效地展平 elasticsearch_dsl 查询返回的 JSON 结构？

问题

答案1

使用字典定义 Pandas 数据帧中计算出现次数的条件。

在pyspark中计算DataFrame的原始累积和。

ValueError: 无法将字符串转换为浮点数: ‘Intel’

将pandas数据框减少为一个具有重复值列表的列。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。