2023年6月22日 01:23:57go评论104阅读模式

英文:

Dataframe: Permutations after grouping column data into a dictionary/list with recurring elements

问题

我正在努力将大量数据集转化为压缩行，以便能够对其进行排列组合并获取所有可能的组合。

基本上结构如下：

    PLOT    SEED    SUB    EDGE    PT    0    1    2    3
1    117    1    1    1    2    8    None    None    None
2    117    1    1    1    3    8    None    None    None
3    117    1    1    2    1    8    None    None    None
4    117    1    1    2    2    6    7    8    None
5    117    1    2    1    1    8    None    None    None

我不确定我应该使用列表还是字典，因为这些数据未来可能会有数千甚至数百万行。

想法是将 ['SUB','EDGE','PT'] 的值添加到 [0:] 列中的每个值（它不一定会以3结束），所以例如第4行将会是：

    PLOT    SEED    SUB    TYPES
4    117    1    1    [1,2,2,6],[1,2,2,7],[1,2,2,8]

或者字典：

    PLOT    SEED    SUB    TYPES
4    117    1    1    {&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:6},{&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:7},{&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:8}

然后在 ['PLOT','SEED'] 列中的每一个 ['SUB'] 上运行排列组合。

这意味着如果 Plot 117, Seed 1 有 2 个 Subs，它将对 Sub1 中的所有选项与 Sub1 中的所有选项之间的所有可能组合进行运算，同时保留 ['PLOT','SEED'] 数据。所以在这种情况下（基于上面的原始5行表）：

# 第一个 Sub:
[1,1,2,8],[1,1,3,8],[1,2,1,8],[1,2,2,6],[1,2,2,7],[1,2,2,8]
# 第二个 Sub:
[2,1,1,8]

我应该得到的是：

# Plot 117 Seed 1 排列组合:
[[1,1,2,8],[2,1,1,8]], [[1,1,3,8],[2,1,1,8]], [[1,2,1,8],[2,1,1,8]], ...
或者用字典表示：
[{&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:6},{&#39;SUB&#39;:2, &#39;EDGE&#39;:1, &#39;PT&#39;:1, &#39;TYPE&#39;:8}], ...

显然，这需要对每一个 ['PLOT','SEED'] 进行整个数据集的迭代，而且在许多情况下，每个 PLOT+SEED 都会有超过 2 个 'SUB'。

我尝试了许多方法，比如 'groupby'：

df_grouped_edge = df_joined.groupby([&#39;PLOT&#39;,&#39;SEED&#39;,&#39;SUB&#39;,&#39;EDGE&#39;], as_index=False).agg(np.array)
df_grouped_edge

我得到了：

    PLOT    SEED    SUB    EDGE    PT    0    1    2    3
0    117    1    1    1    [2, 3]    [8, 8]    [None, None]    [None, None]    [None, None]
1    117    1    1    2    [1, 2, 3]    [8, 6, 8]    [None, 7, None]    [None, 8, None]    [None, None, None]

这不是我要的，或者至少我认为不是，我不知道如何处理这种输出。过去两天里，我尝试了许多其他分组的方法，但似乎无法让它工作，我明白了对大型数据集进行循环遍历是一种非常错误的工作方式（我应该将其矢量化，只是我还不明白如何做）。

我希望我解释得清楚，我知道这是一个很大的问题，但我真的不知所措。很快就要开始学习 SQL 了，但我还没有掌握，我必须完成这个任务。

非常感谢！
Lev

英文:

I'm struggling with manipulating large sets of data into condensed rows that will allow me to run permutations on and get all possible combinations.
Basically this is the structure:

	PLOT	SEED	SUB	EDGE	PT	0	1	2	3
1	117	1	1	1	2	8	None	None	None
2	117	1	1	1	3	8	None	None	None
3	117	1	1	2	1	8	None	None	None
4	117	1	1	2	2	6	7	8	None
5	117	1	2	1	1	8	None	None	None

I'm not sure if I should use lists or dictionaries, as this data should have thousands to even millions of rows in the future.

The idea is to add the ['SUB','EDGE','PT'] values to each value in the [0:] columns (it won't necessarily end at 3), so that for example row number 4 will look like:

	PLOT	SEED	SUB	TYPES
4	117	1	1	[1,2,2,6],[1,2,2,7],[1,2,2,8]

OR dictionaries:

	PLOT	SEED	SUB	TYPES
4	117	1	1	{&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:6},{&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:7},{&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:8}

And then run permutations on each ['PLOT','SEED'] using the ['SUB'] column.
That means that if Plot 117, Seed 1 has 2 Subs, it will run all possible combinations between all the options in Sub1 and all the options in Sub1, while keeping the ['PLOT','SEED'] data. So in this case (building on the original 5 row table up top):

# 1st Sub:
[1,1,2,8],[1,1,3,8],[1,2,1,8],[1,2,2,6],[1,2,2,7],[1,2,2,8]
# 2nd Sub:
[2,1,1,8]

And What I should get is:

#Plot 117 Seed 1 Permutations:
[[1,1,2,8],[2,1,1,8]], [[1,1,3,8],[2,1,1,8]], [[1,2,1,8],[2,1,1,8]], ...
Or with Dictionaries:
[{&#39;SUB&#39;:1, &#39;EDGE&#39;:2, &#39;PT&#39;:2, &#39;TYPE&#39;:6},{&#39;SUB&#39;:2, &#39;EDGE&#39;:1, &#39;PT&#39;:1, &#39;TYPE&#39;:8}], ...

Obviously this needs to iterate over the entire dataset for each ['PLOT','SEED'], of which there will be many, and also in many cases there will be more than 2 'SUB' per PLOT+SEED.

I tried many things like 'groupby', for example:

df_grouped_edge = df_joined.groupby([&#39;PLOT&#39;,&#39;SEED&#39;,&#39;SUB&#39;,&#39;EDGE&#39;], as_index=False).agg(np.array)
df_grouped_edge

For which I get:

	PLOT	SEED	SUB	EDGE	PT	0	1	2	3
0	117	1	1	1	[2, 3]	[8, 8]	[None, None]	[None, None]	[None, None]
1	117	1	1	2	[1, 2, 3]	[8, 6, 8]	[None, 7, None]	[None, 8, None]	[None, None, None]

This is not what I'm looking for, or at least I don't think it is and I don't know how to work with that output. Tried many other grouping attempts the last two days but I can't seem to get it work, and I understand that looping over large datasets per item/column is a very wrong way of working (and I should vectorize it, I just don't understand how yet).

I hope I explained myself well, I know this is a big question but I'm really at a loss.
Starting to learn SQL very soon but I'm not there yet, and I have to get this done.

Thank you very much!
Lev

答案1

得分: 1

首先，按列 ['PLOT', 'SEED'] 对数据进行分组。这将允许你在数据集中迭代每个唯一的 ['PLOT', 'SEED'] 组合。

grouped_data = df.groupby([&#39;PLOT&#39;, &#39;SEED&#39;])

定义一个函数，该函数接受一个组，并执行生成组合所需的操作。将使用 apply() 将此函数应用于每个组。

import itertools
def generate_combinations(group):
    sub_values = group[&#39;SUB&#39;].unique()
    combinations = list(itertools.product(*sub_values))
    
    # 创建一个新的 DataFrame 以存储组合
    result_df = pd.DataFrame(columns=df.columns)
    
    for combination in combinations:
        # 使用组合值创建一个新行
        new_row = group.iloc[0].copy()
        new_row[[&#39;SUB&#39;, &#39;EDGE&#39;, &#39;PT&#39;]] = combination
        
        # 将行附加到结果 DataFrame
        result_df = result_df.append(new_row, ignore_index=True)
    
    return result_df

将 generate_combinations 函数应用于分组数据中的每个组，并将结果串联起来。

result = pd.concat([generate_combinations(group) for _, group in grouped_data], ignore_index=True)

现在，结果应该包含一个包含所有可能组合的 DataFrame，对应于每个唯一的 ['PLOT', 'SEED'] 组合，基于 'SUB' 列中的值。

英文:

First, group your data by the columns ['PLOT', 'SEED']. This will allow you to iterate over each unique combination of ['PLOT', 'SEED'] in your dataset.

grouped_data = df.groupby([&#39;PLOT&#39;, &#39;SEED&#39;])

Define a function that takes a group and performs the necessary operations to generate the combinations. This function will be applied to each group using apply().

import itertools
def generate_combinations(group):
    sub_values = group[&#39;SUB&#39;].unique()
    combinations = list(itertools.product(*sub_values))
    
    # Create a new DataFrame to store the combinations
    result_df = pd.DataFrame(columns=df.columns)
    
    for combination in combinations:
        # Create a new row with the combination values
        new_row = group.iloc[0].copy()
        new_row[[&#39;SUB&#39;, &#39;EDGE&#39;, &#39;PT&#39;]] = combination
        
        # Append the row to the result DataFrame
        result_df = result_df.append(new_row, ignore_index=True)
    
    return result_df

Apply the generate_combinations function to each group in the grouped data and concatenate the results.

result = pd.concat([generate_combinations(group) for _, group in grouped_data], ignore_index=True)

Now, the result should contain the DataFrame with all the possible combinations for each unique ['PLOT', 'SEED'] combination, based on the values in the 'SUB' column.

答案2

得分: 0

使用.melt()解决了这个问题：

df.melt(id_vars=["PLOT", "SEED", "SUB", "EDGE", "PT"], 
        var_name="DROP", 
        value_name="TYPO",
        ignore_index=True) # 这里False会保留原始索引的副本
df=df.drop('DROP', axis=1)
# 删除没有匹配错字的情景
df=df.dropna()

在将每个错字值作为一行获取后，我使用了.groupby(['PLOT', 'SEED']转换成字典，然后对字典中的每个条目进行迭代，并使用groupby('SUB'将所有内容放置在合适的位置。

然后在一个大列表中的每个字典中，我只使用了包含所有选项的列表的键，并使用itertools.product获取所有这些列表之间的组合。

显然，这需要对数据结构进行大量的工作，将相应的数据重新添加到每个字典中，可能不是最干净的工作，但它完成了任务。

我现在的问题是，当数据集变得太大时，.product停止工作并出现了MemoryError：
链接

英文:

I ended up solving it using .melt():

df.melt(id_vars=[&quot;PLOT&quot;, &quot;SEED&quot;, &quot;SUB&quot;, &quot;EDGE&quot;, &quot;PT&quot;], 
        var_name=&quot;DROP&quot;, 
        value_name=&quot;TYPO&quot;,
        ignore_index=True) #False here leaves duplicates of the original indexes
df=df.drop(&#39;DROP&#39;, axis=1)
# Drop Scenarios with no typos matching
df=df.dropna()

After getting each typo value as a row, I used .groupby(['PLOT', 'SEED'] into dictionaries, then iterated over each entry in the dict and used groupby('SUB'... everything in place.

Then in each dict in a large list I only used the key that has the list of lists (with all the options) and I used itertools.product to get all the combinations between said lists.

Obviously this required a lot of work on the data structures, adding back the corresponding data into each dict, probably not the cleanest work, but it did the job.

My problem now is that .product stopped working due to MemoryError when the dataset became too big:
https://stackoverflow.com/questions/76559405/using-np-meshgrid-on-list-of-lists-with-different-lengths-in-a-large-dataset-in

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据框架：将列数据分组到字典/列表中，其中包含重复元素的排列组合。

问题

答案1

答案2

请问有人可以帮助我理解这个冒号在这里是做什么的吗？

The Humanitarian Data Exchange（hdx API python）上的配置错误。

函数对于阶乘函数的递归函数返回None。

attrs – 如何验证一个Literal或None的实例

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。