匹配数组数据与子数组数据

huangapple go评论66阅读模式
英文:

Matching up array data to sub-array data

问题

I have a sub-array itemdata with a length of 6 rows. This data was originally in found in the main array but has been reformatted so there is 1 unique product per line.

我有一个子数组itemdata,长度为6行。这些数据最初在主数组中找到,但已经重新格式化,所以每行只有一个唯一的产品。

I have the main array saledata with a length of 4 rows which looks a bit like this:

我有一个主数组saledata,长度为4行,看起来有点像这样:

                id    sub-array
            0   001   [{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
            1   002   [{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
            2   003   [{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
            3   004   [{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]

Then I have the sub-array itemdata (which is basically just json normalized column sub-array):

然后我有子数组itemdata(基本上只是json规范化的sub-array列):

        type        id   attributes.status   attributes.quantity    attributes.various_other_data
    0   line_item   78   allocated           1                      etc
    0   line_item   80   allocated           2                      etc
    0   line_item   85   allocated           1                      etc
    1   line_item   86   allocated           1                      etc
    0   line_item   92   allocated           2                      etc
    1   line_item   93   allocated           2                      etc

At the moment, I'm treating sub-array as a string (after it's been json normalized for the second dataframe) which allows me to perform this:

目前,我将子数组视为一个字符串(在第二个数据框规范化为JSON之后),这使我能够执行以下操作:

for f in itemdata['id']:
    df['sub-array'].str.contains(f)

Which yields the following:

这产生了以下结果:

0     True
1    False
2    False
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1     True
2    False
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2     True
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2     True
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2    False
3     True
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2    False
3     True
Name: relationships.line_items.data, dtype: bool

Which is all correct! But now I'm trying to match up the sub-array to the parent array, matching the index of the above results to the initial array saledata where True but struggling to find the right way to do this.

这都是正确的!但现在我正在尝试将子数组与父数组匹配,将上述结果的索引与初始数组saledata匹配为True,但不知道如何正确做到这一点。

Python doesn't seem to like the below approach (truth value of a Series is ambiguous yada yada yada) and not sure how to proceed.

Python似乎不喜欢下面的方法(Series的真值是含糊的之类的),不确定该如何继续。

for f in itemdata['id']:
    if df['sub-array'].str.contains(f) == True:

Any advice greatly appreciated!

非常感谢任何建议!

Edit:

This is what I'm looking for (note the etc's are off & unsure of pandas will allow multiple rows to have the same index value - not a huge issue if not):

这是我正在寻找的(请注意etc的问题,不确定pandas是否允许多行具有相同的索引值 - 如果不允许,这不是一个大问题):

             id   type         itemdata.id   itemdata.attributes.status   itemdata.attributes.quantity
        0   001   line_items   78            allocated              etc
        1   002   line_items   80            allocated              etc
        2   003   line_items   85            allocated              etc
        2   003   line_items   86            allocated              etc
        3   004   line_items   92            allocated              etc
        3   004   line_items   93            allocated              etc
英文:

I have a sub-array itemdata with a length of 6 rows. This data was originally in found in the main array but has been reformatted so there is 1 unique product per line.

I have the main array saledata with a length of 4 rows which looks a bit like this:

            id    sub-array
        0   001   [{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
        1   002   [{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
        2   003   [{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}]
        3   004   [{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}, {'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]

Then I have the sub-array itemdata (which is basically just json normalized column sub-array):

    type        id   attributes.status   attributes.quantity    attributes.various_other_data
0   line_item   78   allocated           1                      etc
0   line_item   80   allocated           2                      etc
0   line_item   85   allocated           1                      etc
1   line_item   86   allocated           1                      etc
0   line_item   92   allocated           2                      etc
1   line_item   93   allocated           2                      etc

At the moment, I'm treating sub-array as a string (after it's been json normalized for the second dataframe) which allows me to perform this:

for f in itemdata['id']:
    df['sub-array'].str.contains(f)

Which yields the following:

0     True
1    False
2    False
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1     True
2    False
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2     True
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2     True
3    False
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2    False
3     True
Name: relationships.line_items.data, dtype: bool
0    False
1    False
2    False
3     True
Name: relationships.line_items.data, dtype: bool

Which is all correct! But now I'm trying to match up the sub-array to the parent array, matching the index of the above results to the initial array saledata where True but struggling to find the right way to do this.

Python doesn't seem to like the below approach (truth value of a Series is ambiguous yada yada yada) and not sure how to proceed.

for f in itemdata['id']:
    if df['sub-array'].str.contains(f) == True:

Any advice greatly appreciated!

Edit:

This is what I'm looking for (note the etc's are off & unsure of pandas will allow multiple rows to have the same index value - not a huge issue if not):

         id   type         itemdata.id   itemdata.attributes.status   itemdata.attributes.quantity
    0   001   line_items   78            allocated              etc
    1   002   line_items   80            allocated              etc
    2   003   line_items   85            allocated              etc
    2   003   line_items   86            allocated              etc
    3   004   line_items   92            allocated              etc
    3   004   line_items   93            allocated              etc

答案1

得分: 1

你可以使用 DataFrame.join 来在规范化 sub-array 并通过 Series.explode 按行的方式设置索引后,将 id(或多个列)附加到它:

import ast

df['sub-array'] = df['sub-array'].apply(ast.literal_eval)

s = df['sub-array'].explode()

cols = ['id']
df = df[cols].add_suffix('_parent').join(pd.json_normalize(s).set_index(s.index))
print(df)
  id_parent        type  id attributes.status  attributes.quantity  \
0       001  line_items  78         allocated                    1   
1       002  line_items  80         allocated                    2   
2       003  line_items  85         allocated                    1   
2       003  line_items  86         allocated                    1   
3       004  line_items  92         allocated                    2   
3       004  line_items  93         allocated                    2   

  attributes.various_other_data  
0                           etc  
1                           etc  
2                           etc  
2                           etc  
3                           etc  
3                           etc  

如果只需处理 id 列,且 id 值是唯一的,可以创建辅助 Series 并使用 Series.map

s = df.set_index('id')['sub-array'].apply(ast.literal_eval).explode().str.get('id')
df['id_parent'] = df['id'].map(s)
英文:

You can use DataFrame.join if need append id (or multiple columns) after normalize sub-array with set indices by exploded rows by Series.explode:

import ast

df['sub-array'] = df['sub-array'].apply(ast.literal_eval)

s = df['sub-array'].explode()

cols = ['id']
df = df[cols].add_suffix('_parent').join(pd.json_normalize(s).set_index(s.index))
print (df)
  id_parent        type  id attributes.status  attributes.quantity  \
0       001  line_items  78         allocated                    1   
1       002  line_items  80         allocated                    2   
2       003  line_items  85         allocated                    1   
2       003  line_items  86         allocated                    1   
3       004  line_items  92         allocated                    2   
3       004  line_items  93         allocated                    2   

  attributes.various_other_data  
0                           etc  
1                           etc  
2                           etc  
2                           etc  
3                           etc  
3                           etc  

If need processing only id column and id values are unique create helper Series and use Series.map:

s = df.set_index('id')['sub-array'].apply(ast.literal_eval).explode().str.get('id')
df['id_parent'] = df['id'].map(s)

答案2

得分: 1

你可以通过以下方式生成主数组:

import pandas as pd

# 示例数据
saledata = pd.DataFrame({
    'id': ['001', '002', '003', '004'],
    'sub-array': [[{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
                  [{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}],
                  [{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}},
                   {'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
                  [{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}},
                   {'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
                 ]
})

itemdata = pd.DataFrame({
    'type': ['line_item', 'line_item', 'line_item', 'line_item', 'line_item', 'line_item'],
    'id': ['78', '80', '85', '86', '92', '93'],
    'attributes.status': ['allocated', 'allocated', 'allocated', 'allocated', 'allocated', 'allocated'],
    'attributes.quantity': [1, 2, 1, 1, 2, 2],
    'attributes.various_other_data': ['etc', 'etc', 'etc', 'etc', 'etc', 'etc']
})
import numpy as np

item_id2sale_ids = {i_id: np.where(df['sub-array'].apply(lambda x: any(item['id'] == i_id for item in x))) for i_id in itemdata['id']}

item_id2sale_ids

请注意,代码中的中文字符已经翻译并转换为英文字符。

英文:

So you can generate the main array by:

import pandas as pd

# Sample data
saledata = pd.DataFrame({
    'id': ['001', '002', '003', '004'],
    'sub-array': [[{'type': 'line_items', 'id': '78', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
                  [{'type': 'line_items', 'id': '80', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}],
                  [{'type': 'line_items', 'id': '85', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}},
                   {'type': 'line_items', 'id': '86', 'attributes': {'status': 'allocated', 'quantity': 1, 'various_other_data': 'etc'}}],
                  [{'type': 'line_items', 'id': '92', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}},
                   {'type': 'line_items', 'id': '93', 'attributes': {'status': 'allocated', 'quantity': 2, 'various_other_data': 'etc'}}]
                 ]
})

itemdata = pd.DataFrame({
    'type': ['line_item', 'line_item', 'line_item', 'line_item', 'line_item', 'line_item'],
    'id': ['78', '80', '85', '86', '92', '93'],
    'attributes.status': ['allocated', 'allocated', 'allocated', 'allocated', 'allocated', 'allocated'],
    'attributes.quantity': [1, 2, 1, 1, 2, 2],
    'attributes.various_other_data': ['etc', 'etc', 'etc', 'etc', 'etc', 'etc']
})
In [4]: import numpy as np                                                                                                 

In [5]: item_id2sale_ids = {i_id: np.where(df['sub-array'].apply(lambda x: any(item['id'] == i_id for item in x))) for i_id
   ...:  in itemdata['id']}                                                                                                

In [6]: item_id2sale_ids                                                                                                   
Out[6]: 
{'78': (array([0], dtype=int32),),
 '80': (array([1], dtype=int32),),
 '85': (array([2], dtype=int32),),
 '86': (array([2], dtype=int32),),
 '92': (array([3], dtype=int32),),
 '93': (array([3], dtype=int32),)}

huangapple
  • 本文由 发表于 2023年6月9日 13:57:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76437569.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定