2023年3月7日 03:11:03go评论127阅读模式

英文:

Is it possible to join reference data into a nested dict in a pandas dataframe?

问题

我正在尝试连接两个pandas数据帧 - "left"表，其中包含一个具有复杂类型（字典数组）的列，而"right"表是一个扁平的参考表。

这些伪表格的表示如下：

left_df

parent_id	array_column
1	[{id: 1}, {id: 3}]
2	[{id: 2}, {id: 4}]

right_df

id	value
1	one
2	two
3	three
4	four

我试图查找/连接right df中的值到left df的array_column中的数组，使用id进行关联，但发现这相当棘手。

期望的结果

parent_id	array_column
1	[{id: 1, value: 'one'}, {id: 3, value: 'three'}]
2	[{id: 2, value: 'two'}, {id: 4, value: 'four'}]

我最初的天真方法是使用合并，如下所示。

desired_df = pd.merge(left_df, right_df, how='outer', left_on='array_column.[id]', right_on='id')

显然这是失败的 - 我不太确定如何进一步处理。实际上，目标是在数组中查找参考数据，但经过多次搜索，我无法清晰地表达问题，以便Google结果可以显示一些有助于解决问题的东西。

感谢任何人可以提供的关于此问题的指导，无论是使用pandas还是其他方法。谢谢！

英文:

I am trying to join two pandas data frames - the "left" table, which contains a column with a complex type (an array of dicts) and the "right" table is a flat reference table.

pseudo table representation of these as follows

left_df

parent_id	array_column
1	[{id: 1}, {id: 3}]
2	[{id: 2}, {id: 4}]

right_df

id	value
1	one
2	two
3	three
4	four

I'm aiming to lookup/join the values from the right df into the array in the array_column of the left df using id's, but have found this quite tricky.

desired outcome

parent_id	array_column
1	[{id: 1, value:'one'}, {id: 3, value: 'three'}]
2	[{id: 2, value: 'two'}, {id: 4, value: 'four'}]

My naive approach to start with was to use a merge, as per the following approach.

desired_df = pd.merge(left_df, right_df, how=&#39;outer&#39;, left_on = &#39;array_column.[&#39;id&#39;]&#39;, right_on = &#39;id&#39;)

Obviously this failed - not quite sure how I can progress further. Effectively the aim is to lookup reference data onto dicts within an array, but after much searching I've not been able to articulate the problem well enough for a google result to show something that can help.

Appreciate any guidance anyone can share on this, whether using pandas or not. Thank you!

答案1

得分: 0

合并可能不是正确的方法，因为您正在存储像包含字典列表等复杂对象类型。即便如此，您可以从right_df创建一个字典，然后使用它与map一起在left_df中替换并追加新的键值对。

d = right_df.set_index('id')['value']
left_df['array_column'] = left_df['array_column'].map(lambda x: [{**y, 'value': d.get(y['id'])} for y in x])

结果：

   parent_id                                      array_column
0          1  [{'id': 1, 'value': 'one'}, {'id': 3, 'value': 'three'}]
1          2  [{'id': 2, 'value': 'two'}, {'id': 4, 'value': 'four'}]

英文:

Merge might not be the right approach since you are storing complex object types like list of dict having said that you can create a dictionary from the right_df then use it with map to substitute and append the new key-val pairs in left_df

d = right_df.set_index(&#39;id&#39;)[&#39;value&#39;]
left_df[&#39;array_column&#39;] = left_df[&#39;array_column&#39;].map(lambda x: [{**y, &#39;value&#39;: d.get(y[&#39;id&#39;])} for y in x])

Result

   parent_id                                              array_column
0          1  [{&#39;id&#39;: 1, &#39;value&#39;: &#39;one&#39;}, {&#39;id&#39;: 3, &#39;value&#39;: &#39;three&#39;}]
1          2   [{&#39;id&#39;: 2, &#39;value&#39;: &#39;two&#39;}, {&#39;id&#39;: 4, &#39;value&#39;: &#39;four&#39;}]

答案2

得分: 0

用合并操作，代码看起来是这样的：

temp = left_df.explode("array_column")
temp = temp.merge(
    right_df, left_on=temp["array_column"].apply(lambda x: x.get("id")), right_on="id"
).drop(columns="id")
temp["array_column"] = temp.apply(
    lambda x: {**x["array_column"], "value": x["value"]}, axis=1
)
out = temp.groupby("parent_id")["array_column"].agg(list).reset_index()
print(out)
   parent_id                                      array_columns
0          1  [{'id': 1, 'value': 'one'}, {'id': 3, 'value':...
1          2  [{'id': 2, 'value': 'two'}, {'id': 4, 'value':...

英文:

With merge it would look like:

temp = left_df.explode(&quot;array_column&quot;)
temp = temp.merge(
    right_df, left_on=temp[&quot;array_column&quot;].apply(lambda x: x.get(&quot;id&quot;)), right_on=&quot;id&quot;
).drop(columns=&quot;id&quot;)
temp[&quot;array_column&quot;] = temp.apply(
    lambda x: {**x[&quot;array_column&quot;], &quot;value&quot;: x[&quot;value&quot;]}, axis=1
)
out = temp.groupby(&quot;parent_id&quot;)[&quot;array_column&quot;].agg(list).reset_index()
print(out)
   parent_id                                      array_columns
0          1  [{&#39;id&#39;: 1, &#39;value&#39;: &#39;one&#39;}, {&#39;id&#39;: 3, &#39;value&#39;:...
1          2  [{&#39;id&#39;: 2, &#39;value&#39;: &#39;two&#39;}, {&#39;id&#39;: 4, &#39;value&#39;:...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

能否将引用数据加入到pandas数据框中的嵌套字典？

问题

答案1

答案2

在Pandas中创建假期布尔列

for循环或while循环的时间复杂度

如何迭代地为列表中的每个值设置@property装饰器？

Python递归函数中添加签名。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。