英文:
Matching up array data to sub-array data
问题
I have a sub-array itemdata with a length of 6 rows. This data was originally in found in the main array but has been reformatted so there is 1 unique product per line.
我有一个子数组itemdata,长度为6行。这些数据最初在主数组中找到,但已经重新格式化,所以每行只有一个唯一的产品。
I have the main array saledata with a length of 4 rows which looks a bit like this:
我有一个主数组saledata,长度为4行,看起来有点像这样:
id sub-array
0 001 [{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
1 002 [{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
2 003 [{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
3 004 [{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
Then I have the sub-array itemdata (which is basically just json normalized column sub-array):
然后我有子数组itemdata(基本上只是json规范化的sub-array列):
type id attributes.status attributes.quantity attributes.various_other_data
0 line_item 78 allocated 1 etc
0 line_item 80 allocated 2 etc
0 line_item 85 allocated 1 etc
1 line_item 86 allocated 1 etc
0 line_item 92 allocated 2 etc
1 line_item 93 allocated 2 etc
At the moment, I'm treating sub-array as a string (after it's been json normalized for the second dataframe) which allows me to perform this:
目前,我将子数组视为一个字符串(在第二个数据框规范化为JSON之后),这使我能够执行以下操作:
for f in itemdata['id']:
df['sub-array'].str.contains(f)
Which yields the following:
这产生了以下结果:
0 True
1 False
2 False
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 True
2 False
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 True
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 True
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 False
3 True
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 False
3 True
Name: relationships.line_items.data, dtype: bool
Which is all correct! But now I'm trying to match up the sub-array to the parent array, matching the index of the above results to the initial array saledata where True but struggling to find the right way to do this.
这都是正确的!但现在我正在尝试将子数组与父数组匹配,将上述结果的索引与初始数组saledata匹配为True,但不知道如何正确做到这一点。
Python doesn't seem to like the below approach (truth value of a Series is ambiguous yada yada yada) and not sure how to proceed.
Python似乎不喜欢下面的方法(Series的真值是含糊的之类的),不确定该如何继续。
for f in itemdata['id']:
if df['sub-array'].str.contains(f) == True:
Any advice greatly appreciated!
非常感谢任何建议!
Edit:
This is what I'm looking for (note the etc's are off & unsure of pandas will allow multiple rows to have the same index value - not a huge issue if not):
这是我正在寻找的(请注意etc的问题,不确定pandas是否允许多行具有相同的索引值 - 如果不允许,这不是一个大问题):
id type itemdata.id itemdata.attributes.status itemdata.attributes.quantity
0 001 line_items 78 allocated etc
1 002 line_items 80 allocated etc
2 003 line_items 85 allocated etc
2 003 line_items 86 allocated etc
3 004 line_items 92 allocated etc
3 004 line_items 93 allocated etc
英文:
I have a sub-array itemdata with a length of 6 rows. This data was originally in found in the main array but has been reformatted so there is 1 unique product per line.
I have the main array saledata with a length of 4 rows which looks a bit like this:
id sub-array
0 001 [{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
1 002 [{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
2 003 [{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
3 004 [{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
Then I have the sub-array itemdata (which is basically just json normalized column sub-array):
type id attributes.status attributes.quantity attributes.various_other_data
0 line_item 78 allocated 1 etc
0 line_item 80 allocated 2 etc
0 line_item 85 allocated 1 etc
1 line_item 86 allocated 1 etc
0 line_item 92 allocated 2 etc
1 line_item 93 allocated 2 etc
At the moment, I'm treating sub-array as a string (after it's been json normalized for the second dataframe) which allows me to perform this:
for f in itemdata['id']:
df['sub-array'].str.contains(f)
Which yields the following:
0 True
1 False
2 False
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 True
2 False
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 True
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 True
3 False
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 False
3 True
Name: relationships.line_items.data, dtype: bool
0 False
1 False
2 False
3 True
Name: relationships.line_items.data, dtype: bool
Which is all correct! But now I'm trying to match up the sub-array to the parent array, matching the index of the above results to the initial array saledata where True but struggling to find the right way to do this.
Python doesn't seem to like the below approach (truth value of a Series is ambiguous yada yada yada) and not sure how to proceed.
for f in itemdata['id']:
if df['sub-array'].str.contains(f) == True:
Any advice greatly appreciated!
Edit:
This is what I'm looking for (note the etc's are off & unsure of pandas will allow multiple rows to have the same index value - not a huge issue if not):
id type itemdata.id itemdata.attributes.status itemdata.attributes.quantity
0 001 line_items 78 allocated etc
1 002 line_items 80 allocated etc
2 003 line_items 85 allocated etc
2 003 line_items 86 allocated etc
3 004 line_items 92 allocated etc
3 004 line_items 93 allocated etc
答案1
得分: 1
你可以使用 DataFrame.join
来在规范化 sub-array
并通过 Series.explode
按行的方式设置索引后,将 id
(或多个列)附加到它:
import ast
df['sub-array'] = df['sub-array'].apply(ast.literal_eval)
s = df['sub-array'].explode()
cols = ['id']
df = df[cols].add_suffix('_parent').join(pd.json_normalize(s).set_index(s.index))
print(df)
id_parent type id attributes.status attributes.quantity \
0 001 line_items 78 allocated 1
1 002 line_items 80 allocated 2
2 003 line_items 85 allocated 1
2 003 line_items 86 allocated 1
3 004 line_items 92 allocated 2
3 004 line_items 93 allocated 2
attributes.various_other_data
0 etc
1 etc
2 etc
2 etc
3 etc
3 etc
如果只需处理 id
列,且 id
值是唯一的,可以创建辅助 Series 并使用 Series.map
:
s = df.set_index('id')['sub-array'].apply(ast.literal_eval).explode().str.get('id')
df['id_parent'] = df['id'].map(s)
英文:
You can use DataFrame.join
if need append id
(or multiple columns) after normalize sub-array
with set indices by exploded rows by Series.explode
:
import ast
df['sub-array'] = df['sub-array'].apply(ast.literal_eval)
s = df['sub-array'].explode()
cols = ['id']
df = df[cols].add_suffix('_parent').join(pd.json_normalize(s).set_index(s.index))
print (df)
id_parent type id attributes.status attributes.quantity \
0 001 line_items 78 allocated 1
1 002 line_items 80 allocated 2
2 003 line_items 85 allocated 1
2 003 line_items 86 allocated 1
3 004 line_items 92 allocated 2
3 004 line_items 93 allocated 2
attributes.various_other_data
0 etc
1 etc
2 etc
2 etc
3 etc
3 etc
If need processing only id
column and id
values are unique create helper Series and use Series.map
:
s = df.set_index('id')['sub-array'].apply(ast.literal_eval).explode().str.get('id')
df['id_parent'] = df['id'].map(s)
答案2
得分: 1
你可以通过以下方式生成主数组:
import pandas as pd
# 示例数据
saledata = pd.DataFrame({
'id': ['001', '002', '003', '004'],
'sub-array': [[{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
[{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}],
[{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}},
{'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
[{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}},
{'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
]
})
itemdata = pd.DataFrame({
'type': ['line_item', 'line_item', 'line_item', 'line_item', 'line_item', 'line_item'],
'id': ['78', '80', '85', '86', '92', '93'],
'attributes.status': ['allocated', 'allocated', 'allocated', 'allocated', 'allocated', 'allocated'],
'attributes.quantity': [1, 2, 1, 1, 2, 2],
'attributes.various_other_data': ['etc', 'etc', 'etc', 'etc', 'etc', 'etc']
})
import numpy as np
item_id2sale_ids = {i_id: np.where(df['sub-array'].apply(lambda x: any(item['id'] == i_id for item in x))) for i_id in itemdata['id']}
item_id2sale_ids
请注意,代码中的中文字符已经翻译并转换为英文字符。
英文:
So you can generate the main array by:
import pandas as pd
# Sample data
saledata = pd.DataFrame({
'id': ['001', '002', '003', '004'],
'sub-array': [[{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
[{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}],
[{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}},
{'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
[{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}},
{'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
]
})
itemdata = pd.DataFrame({
'type': ['line_item', 'line_item', 'line_item', 'line_item', 'line_item', 'line_item'],
'id': ['78', '80', '85', '86', '92', '93'],
'attributes.status': ['allocated', 'allocated', 'allocated', 'allocated', 'allocated', 'allocated'],
'attributes.quantity': [1, 2, 1, 1, 2, 2],
'attributes.various_other_data': ['etc', 'etc', 'etc', 'etc', 'etc', 'etc']
})
In [4]: import numpy as np
In [5]: item_id2sale_ids = {i_id: np.where(df['sub-array'].apply(lambda x: any(item['id'] == i_id for item in x))) for i_id
...: in itemdata['id']}
In [6]: item_id2sale_ids
Out[6]:
{'78': (array([0], dtype=int32),),
'80': (array([1], dtype=int32),),
'85': (array([2], dtype=int32),),
'86': (array([2], dtype=int32),),
'92': (array([3], dtype=int32),),
'93': (array([3], dtype=int32),)}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论