英文:
Pandas: Convert group into list of jsons without using groupby or apply
问题
You can achieve the desired result more efficiently by using the groupby
function along with agg
to aggregate the data into lists of dictionaries. Here's a more optimized approach:
import pandas as pd
# Your item_dict and item_df here...
# Define a custom aggregation function to convert rows to dictionaries
def to_dict_records(group):
return [dict(zip(group.columns, row)) for row in group.itertuples(index=False)]
# Group by the specified columns and aggregate the data
result_df = item_df.groupby(['BarCode_x', 'Extracted_Code', 'Unique_Code_x'], as_index=False).apply(to_dict_records)
result_df = result_df.reset_index(drop=True).rename(columns={0: 'Grouped'})
# Print the result_df
print(result_df)
This code should be significantly faster than your original approach, especially for a large dataframe, as it avoids the use of apply
on the entire dataframe. It directly iterates through rows within each group and constructs the desired dictionaries.
英文:
I have an item dataframe such as:
item_dict = {
'index': [18, 24, 25, 26, 30, 31, 37, 38, 61, 62, 63, 67, 68, 69],
'BarCode_x': ['12345678ABCD', '12345678IJKL', '12345678IJKL', '12345678IJKL', '12345678EFGH', '12345678EFGH', '67890123IJKL', '67890123IJKL', '67890123ABCD', '67890123ABCD', '67890123ABCD', '67890123EFGH', '67890123EFGH', '67890123EFGH'],
'Extracted_Code': ['12345678', '12345678', '12345678', '12345678', '12345678', '12345678', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123'],
'Description_x': ['Apples', 'Mangoes', 'Mangoes', 'Mangoes', 'Oranges', 'Oranges', 'Oats', 'Oats', 'Yoghurt', 'Yoghurt', 'Yoghurt', 'Cookies', 'Cookies', 'Cookies'],
'Unique_Code_x': ['EFG', 'LMO', 'LMO', 'LMO', 'JKL', 'JKL', 'OPZ', 'OPZ', 'YQA', 'YQA', 'YQA', 'CDF', 'CDF', 'CDF'],
'Category_x': ['M', 'S', 'S', 'S', 'T', 'T', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M'],
'Code_x': [1, 4, 4, 4, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4],
'Quantity_x': [52, 90, 90, 90, 11, 11, 90, 90, 52, 52, 52, 11, 11, 11],
'Price_x': [15.6, 67.0, 67.0, 67.0, 12.9, 12.9, 67.0, 67.0, 15.6, 15.6, 15.6, 12.9, 12.9, 12.9],
'BarCode': ['12345678AAAA', '12345678AAAA', '12345678BBBB', '12345678CCCC', '12345678AAAA', '12345678BBBB', '67890123XXXX', '67890123YYYY', '67890123XXXX', '67890123YYYY', '67890123ZZZZ', '67890123XXXX', '67890123YYYY', '67890123ZZZZ'],
'Description': ['Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks'],
'Unique_Code': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ'],
'Category': ['H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'],
'Code': [0, 0, 2, 3, 0, 2, 0, 2, 0, 2, 3, 0, 2, 3],
'Quantity': [99, 99, 77, 10, 99, 77, 99, 77, 99, 77, 10, 99, 77, 10],
'Price': [12.0, 12.0, 10.5, 11.0, 12.0, 10.5, 12.0, 10.5, 12.0, 10.5, 11.0, 12.0, 10.5, 11.0]
}
item_df = pd.DataFrame(item_dict)
I am trying to group the dataframe based on ['BarCode_x', 'Extracted_Code', 'Unique_Code_x']
, convert each group into a list of jsons and store it in a new column Grouped
. My desired result is:
BarCode_x Extracted_Code Unique_Code_x Grouped
12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
This is what I have done:
item_df.groupby(['BarCode_x', 'Extracted_Code', 'Unique_Code_x'])[["BarCode", "Description", "Category", "Code", "Quantity", "Price"]].apply(lambda group: group.to_dict("records")).reset_index(name="Grouped")
The item_df
shown above is a small representation of another dataframe that contains over 3 million records. When I apply the above logic using groupby+apply, the process takes 2 hours to complete, which is not feasible. Therefore, is there any way I can achieve the same result in a shorter amount of time using another optimized method instead of using groupby+apply?
答案1
得分: 1
你可以尝试:
out = {}
for a, b, c, *d in zip(
item_df.BarCode_x,
item_df.Extracted_Code,
item_df.Unique_Code_x,
item_df.BarCode,
item_df.Description,
item_df.Category,
item_df.Code,
item_df.Quantity,
item_df.Price,
):
out.setdefault((a, b, c), []).append(d)
ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
df = pd.DataFrame(
((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
)
print(df)
打印结果:
BarCode_x Extracted_Code Unique_Code_x Grouped
0 12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
1 12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
2 12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
3 67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
4 67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
5 67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
快速性能测试:
from timeit import timeit
def fn1(item_df):
x = (
item_df.groupby(["BarCode_x", "Extracted_Code", "Unique_Code_x"])[
["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
]
.apply(lambda group: group.to_dict("records"))
.reset_index(name="Grouped")
)
return x
def fn2(item_df):
out = {}
for a, b, c, *d in zip(
item_df.BarCode_x,
item_df.Extracted_Code,
item_df.Unique_Code_x,
item_df.BarCode,
item_df.Description,
item_df.Category,
item_df.Code,
item_df.Quantity,
item_df.Price,
):
out.setdefault((a, b, c), []).append(d)
ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
return pd.DataFrame(
((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
)
t1 = timeit('fn1(x)', 'x=item_df.copy()',number=1000, globals=globals())
t2 = timeit('fn2(x)', 'x=item_df.copy()',number=1000, globals=globals())
print(t1)
print(t2)
在我的机器上(AMD 5700X/Ubuntu 20.04/Python 3.10.9/Pandas 1.5.3)打印结果为:
2.507308159954846
0.1901476769708097
因此,它应该快约12倍。
英文:
You can try:
out = {}
for a, b, c, *d in zip(
item_df.BarCode_x,
item_df.Extracted_Code,
item_df.Unique_Code_x,
item_df.BarCode,
item_df.Description,
item_df.Category,
item_df.Code,
item_df.Quantity,
item_df.Price,
):
out.setdefault((a, b, c), []).append(d)
ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
df = pd.DataFrame(
((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
)
print(df)
Prints:
BarCode_x Extracted_Code Unique_Code_x Grouped
0 12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
1 12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
2 12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
3 67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
4 67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
5 67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
Quick benchmark:
from timeit import timeit
def fn1(item_df):
x = (
item_df.groupby(["BarCode_x", "Extracted_Code", "Unique_Code_x"])[
["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
]
.apply(lambda group: group.to_dict("records"))
.reset_index(name="Grouped")
)
return x
def fn2(item_df):
out = {}
for a, b, c, *d in zip(
item_df.BarCode_x,
item_df.Extracted_Code,
item_df.Unique_Code_x,
item_df.BarCode,
item_df.Description,
item_df.Category,
item_df.Code,
item_df.Quantity,
item_df.Price,
):
out.setdefault((a, b, c), []).append(d)
ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
return pd.DataFrame(
((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
)
t1 = timeit('fn1(x)', 'x=item_df.copy()',number=1000, globals=globals())
t2 = timeit('fn2(x)', 'x=item_df.copy()',number=1000, globals=globals())
print(t1)
print(t2)
Prints on my machine AMD 5700X/Ubuntu 20.04/Python 3.10.9/Pandas 1.5.3:
2.507308159954846
0.1901476769708097
So it should be ~12x faster.
答案2
得分: 1
以下是一个替代方法(避免使用GroupBy
):
from collections import OrderedDict
use_cols = [
"BarCode_x", "Extracted_Code", "Unique_Code_x", # Groupers
"BarCode", "Description", "Category", "Code", # Aggs (p1)
"Quantity", "Price" # Aggs (p2)
]
def gby_records(df):
data = {}
for r in df[use_cols].to_dict("records"):
key = tuple(r[col] for col in use_cols[:3])
data.setdefault(key, []).append(OrderedDict((k, r[k]) for k in use_cols[3:]))
for k, rs in data.items():
yield {**dict(zip(use_cols[:3], k)), "Grouped": rs}
out = pd.DataFrame(gby_records(item_df))
输出:
BarCode_x | Extracted_Code | Unique_Code_x | Grouped | |
---|---|---|---|---|
0 | 12345678ABCD | 12345678 | EFG | [{'BarCode': '12345678AAAA', 'Description': 'F... |
1 | 12345678IJKL | 12345678 | LMO | [{'BarCode': '12345678AAAA', 'Description': 'F... |
2 | 12345678EFGH | 12345678 | JKL | [{'BarCode': '12345678AAAA', 'Description': 'F... |
3 | 67890123IJKL | 67890123 | OPZ | [{'BarCode': '67890123XXXX', 'Description': 'S... |
4 | 67890123ABCD | 67890123 | YQA | [{'BarCode': '67890123XXXX', 'Description': 'S... |
5 | 67890123EFGH | 67890123 | CDF | [{'BarCode': '67890123XXXX', 'Description': 'S... |
时间记录:
%%timeit # Animeartist
5.84 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit # Timeless
1.28 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit # Andrej Kesely
612 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
英文:
Here is an alternative approach (that avoids using GroupBy
) :
from collections import OrderedDict
use_cols = [
"BarCode_x", "Extracted_Code", "Unique_Code_x", # Groupers
"BarCode", "Description", "Category", "Code", # Aggs (p1)
"Quantity", "Price" # Aggs (p2)
]
def gby_records(df):
data = {}
for r in df[use_cols].to_dict("records"):
key = tuple(r[col] for col in use_cols[:3])
data.setdefault(key, []).append(OrderedDict((k, r[k]) for k in use_cols[3:]))
for k, rs in data.items():
yield {**dict(zip(use_cols[:3], k)), "Grouped": rs}
out = pd.DataFrame(gby_records(item_df))
Output :
BarCode_x | Extracted_Code | Unique_Code_x | Grouped | |
---|---|---|---|---|
0 | 12345678ABCD | 12345678 | EFG | [{'BarCode': '12345678AAAA', 'Description': 'F... |
1 | 12345678IJKL | 12345678 | LMO | [{'BarCode': '12345678AAAA', 'Description': 'F... |
2 | 12345678EFGH | 12345678 | JKL | [{'BarCode': '12345678AAAA', 'Description': 'F... |
3 | 67890123IJKL | 67890123 | OPZ | [{'BarCode': '67890123XXXX', 'Description': 'S... |
4 | 67890123ABCD | 67890123 | YQA | [{'BarCode': '67890123XXXX', 'Description': 'S... |
5 | 67890123EFGH | 67890123 | CDF | [{'BarCode': '67890123XXXX', 'Description': 'S... |
Timings :
🥉 %%timeit #Animeartist
5.84 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
🥈 %%timeit #Timeless
1.28 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
🥇 %%timeit #Andrej Kesely
612 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论