Pandas:将分组转换为 JSON 列表,不使用 groupby 或 apply。

huangapple go评论193阅读模式
英文:

Pandas: Convert group into list of jsons without using groupby or apply

问题

You can achieve the desired result more efficiently by using the groupby function along with agg to aggregate the data into lists of dictionaries. Here's a more optimized approach:

  1. import pandas as pd
  2. # Your item_dict and item_df here...
  3. # Define a custom aggregation function to convert rows to dictionaries
  4. def to_dict_records(group):
  5. return [dict(zip(group.columns, row)) for row in group.itertuples(index=False)]
  6. # Group by the specified columns and aggregate the data
  7. result_df = item_df.groupby(['BarCode_x', 'Extracted_Code', 'Unique_Code_x'], as_index=False).apply(to_dict_records)
  8. result_df = result_df.reset_index(drop=True).rename(columns={0: 'Grouped'})
  9. # Print the result_df
  10. print(result_df)

This code should be significantly faster than your original approach, especially for a large dataframe, as it avoids the use of apply on the entire dataframe. It directly iterates through rows within each group and constructs the desired dictionaries.

英文:

I have an item dataframe such as:

  1. item_dict = {
  2. 'index': [18, 24, 25, 26, 30, 31, 37, 38, 61, 62, 63, 67, 68, 69],
  3. 'BarCode_x': ['12345678ABCD', '12345678IJKL', '12345678IJKL', '12345678IJKL', '12345678EFGH', '12345678EFGH', '67890123IJKL', '67890123IJKL', '67890123ABCD', '67890123ABCD', '67890123ABCD', '67890123EFGH', '67890123EFGH', '67890123EFGH'],
  4. 'Extracted_Code': ['12345678', '12345678', '12345678', '12345678', '12345678', '12345678', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123'],
  5. 'Description_x': ['Apples', 'Mangoes', 'Mangoes', 'Mangoes', 'Oranges', 'Oranges', 'Oats', 'Oats', 'Yoghurt', 'Yoghurt', 'Yoghurt', 'Cookies', 'Cookies', 'Cookies'],
  6. 'Unique_Code_x': ['EFG', 'LMO', 'LMO', 'LMO', 'JKL', 'JKL', 'OPZ', 'OPZ', 'YQA', 'YQA', 'YQA', 'CDF', 'CDF', 'CDF'],
  7. 'Category_x': ['M', 'S', 'S', 'S', 'T', 'T', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M'],
  8. 'Code_x': [1, 4, 4, 4, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4],
  9. 'Quantity_x': [52, 90, 90, 90, 11, 11, 90, 90, 52, 52, 52, 11, 11, 11],
  10. 'Price_x': [15.6, 67.0, 67.0, 67.0, 12.9, 12.9, 67.0, 67.0, 15.6, 15.6, 15.6, 12.9, 12.9, 12.9],
  11. 'BarCode': ['12345678AAAA', '12345678AAAA', '12345678BBBB', '12345678CCCC', '12345678AAAA', '12345678BBBB', '67890123XXXX', '67890123YYYY', '67890123XXXX', '67890123YYYY', '67890123ZZZZ', '67890123XXXX', '67890123YYYY', '67890123ZZZZ'],
  12. 'Description': ['Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks'],
  13. 'Unique_Code': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ'],
  14. 'Category': ['H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'],
  15. 'Code': [0, 0, 2, 3, 0, 2, 0, 2, 0, 2, 3, 0, 2, 3],
  16. 'Quantity': [99, 99, 77, 10, 99, 77, 99, 77, 99, 77, 10, 99, 77, 10],
  17. 'Price': [12.0, 12.0, 10.5, 11.0, 12.0, 10.5, 12.0, 10.5, 12.0, 10.5, 11.0, 12.0, 10.5, 11.0]
  18. }
  19. item_df = pd.DataFrame(item_dict)

I am trying to group the dataframe based on ['BarCode_x', 'Extracted_Code', 'Unique_Code_x'], convert each group into a list of jsons and store it in a new column Grouped. My desired result is:

  1. BarCode_x Extracted_Code Unique_Code_x Grouped
  2. 12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
  3. 12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
  4. 12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
  5. 67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
  6. 67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
  7. 67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]

This is what I have done:

  1. item_df.groupby(['BarCode_x', 'Extracted_Code', 'Unique_Code_x'])[["BarCode", "Description", "Category", "Code", "Quantity", "Price"]].apply(lambda group: group.to_dict("records")).reset_index(name="Grouped")

The item_df shown above is a small representation of another dataframe that contains over 3 million records. When I apply the above logic using groupby+apply, the process takes 2 hours to complete, which is not feasible. Therefore, is there any way I can achieve the same result in a shorter amount of time using another optimized method instead of using groupby+apply?

答案1

得分: 1

你可以尝试:

  1. out = {}
  2. for a, b, c, *d in zip(
  3. item_df.BarCode_x,
  4. item_df.Extracted_Code,
  5. item_df.Unique_Code_x,
  6. item_df.BarCode,
  7. item_df.Description,
  8. item_df.Category,
  9. item_df.Code,
  10. item_df.Quantity,
  11. item_df.Price,
  12. ):
  13. out.setdefault((a, b, c), []).append(d)
  14. ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
  15. df = pd.DataFrame(
  16. ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
  17. columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
  18. )
  19. print(df)

打印结果:

  1. BarCode_x Extracted_Code Unique_Code_x Grouped
  2. 0 12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
  3. 1 12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
  4. 2 12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
  5. 3 67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
  6. 4 67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
  7. 5 67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]

快速性能测试:

  1. from timeit import timeit
  2. def fn1(item_df):
  3. x = (
  4. item_df.groupby(["BarCode_x", "Extracted_Code", "Unique_Code_x"])[
  5. ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
  6. ]
  7. .apply(lambda group: group.to_dict("records"))
  8. .reset_index(name="Grouped")
  9. )
  10. return x
  11. def fn2(item_df):
  12. out = {}
  13. for a, b, c, *d in zip(
  14. item_df.BarCode_x,
  15. item_df.Extracted_Code,
  16. item_df.Unique_Code_x,
  17. item_df.BarCode,
  18. item_df.Description,
  19. item_df.Category,
  20. item_df.Code,
  21. item_df.Quantity,
  22. item_df.Price,
  23. ):
  24. out.setdefault((a, b, c), []).append(d)
  25. ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
  26. return pd.DataFrame(
  27. ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
  28. columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
  29. )
  30. t1 = timeit('fn1(x)', 'x=item_df.copy()',number=1000, globals=globals())
  31. t2 = timeit('fn2(x)', 'x=item_df.copy()',number=1000, globals=globals())
  32. print(t1)
  33. print(t2)

在我的机器上(AMD 5700X/Ubuntu 20.04/Python 3.10.9/Pandas 1.5.3)打印结果为:

  1. 2.507308159954846
  2. 0.1901476769708097

因此,它应该快约12倍。

英文:

You can try:

  1. out = {}
  2. for a, b, c, *d in zip(
  3. item_df.BarCode_x,
  4. item_df.Extracted_Code,
  5. item_df.Unique_Code_x,
  6. item_df.BarCode,
  7. item_df.Description,
  8. item_df.Category,
  9. item_df.Code,
  10. item_df.Quantity,
  11. item_df.Price,
  12. ):
  13. out.setdefault((a, b, c), []).append(d)
  14. ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
  15. df = pd.DataFrame(
  16. ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
  17. columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
  18. )
  19. print(df)

Prints:

  1. BarCode_x Extracted_Code Unique_Code_x Grouped
  2. 0 12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
  3. 1 12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
  4. 2 12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
  5. 3 67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
  6. 4 67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
  7. 5 67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]

Quick benchmark:

  1. from timeit import timeit
  2. def fn1(item_df):
  3. x = (
  4. item_df.groupby(["BarCode_x", "Extracted_Code", "Unique_Code_x"])[
  5. ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
  6. ]
  7. .apply(lambda group: group.to_dict("records"))
  8. .reset_index(name="Grouped")
  9. )
  10. return x
  11. def fn2(item_df):
  12. out = {}
  13. for a, b, c, *d in zip(
  14. item_df.BarCode_x,
  15. item_df.Extracted_Code,
  16. item_df.Unique_Code_x,
  17. item_df.BarCode,
  18. item_df.Description,
  19. item_df.Category,
  20. item_df.Code,
  21. item_df.Quantity,
  22. item_df.Price,
  23. ):
  24. out.setdefault((a, b, c), []).append(d)
  25. ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
  26. return pd.DataFrame(
  27. ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
  28. columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
  29. )
  30. t1 = timeit('fn1(x)', 'x=item_df.copy()',number=1000, globals=globals())
  31. t2 = timeit('fn2(x)', 'x=item_df.copy()',number=1000, globals=globals())
  32. print(t1)
  33. print(t2)

Prints on my machine AMD 5700X/Ubuntu 20.04/Python 3.10.9/Pandas 1.5.3:

  1. 2.507308159954846
  2. 0.1901476769708097

So it should be ~12x faster.

答案2

得分: 1

以下是一个替代方法(避免使用GroupBy):

  1. from collections import OrderedDict
  2. use_cols = [
  3. "BarCode_x", "Extracted_Code", "Unique_Code_x", # Groupers
  4. "BarCode", "Description", "Category", "Code", # Aggs (p1)
  5. "Quantity", "Price" # Aggs (p2)
  6. ]
  7. def gby_records(df):
  8. data = {}
  9. for r in df[use_cols].to_dict("records"):
  10. key = tuple(r[col] for col in use_cols[:3])
  11. data.setdefault(key, []).append(OrderedDict((k, r[k]) for k in use_cols[3:]))
  12. for k, rs in data.items():
  13. yield {**dict(zip(use_cols[:3], k)), "Grouped": rs}
  14. out = pd.DataFrame(gby_records(item_df))

输出:

BarCode_x Extracted_Code Unique_Code_x Grouped
0 12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'F...
1 12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'F...
2 12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'F...
3 67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'S...
4 67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'S...
5 67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'S...

时间记录:

  1. %%timeit # Animeartist
  2. 5.84 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3. %%timeit # Timeless
  4. 1.28 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  5. %%timeit # Andrej Kesely
  6. 612 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
英文:

Here is an alternative approach (that avoids using GroupBy) :

  1. from collections import OrderedDict
  2. use_cols = [
  3. "BarCode_x", "Extracted_Code", "Unique_Code_x", # Groupers
  4. "BarCode", "Description", "Category", "Code", # Aggs (p1)
  5. "Quantity", "Price" # Aggs (p2)
  6. ]
  7. def gby_records(df):
  8. data = {}
  9. for r in df[use_cols].to_dict("records"):
  10. key = tuple(r[col] for col in use_cols[:3])
  11. data.setdefault(key, []).append(OrderedDict((k, r[k]) for k in use_cols[3:]))
  12. for k, rs in data.items():
  13. yield {**dict(zip(use_cols[:3], k)), "Grouped": rs}
  14. out = pd.DataFrame(gby_records(item_df))


Output :

BarCode_x Extracted_Code Unique_Code_x Grouped
0 12345678ABCD 12345678 EFG [{'BarCode': '12345678AAAA', 'Description': 'F...
1 12345678IJKL 12345678 LMO [{'BarCode': '12345678AAAA', 'Description': 'F...
2 12345678EFGH 12345678 JKL [{'BarCode': '12345678AAAA', 'Description': 'F...
3 67890123IJKL 67890123 OPZ [{'BarCode': '67890123XXXX', 'Description': 'S...
4 67890123ABCD 67890123 YQA [{'BarCode': '67890123XXXX', 'Description': 'S...
5 67890123EFGH 67890123 CDF [{'BarCode': '67890123XXXX', 'Description': 'S...

Timings :

  1. 🥉 %%timeit #Animeartist
  2. 5.84 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3. 🥈 %%timeit #Timeless
  4. 1.28 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
  5. 🥇 %%timeit #Andrej Kesely
  6. 612 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

huangapple
  • 本文由 发表于 2023年6月9日 07:22:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436289.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定