数据框架:将列数据分组到字典/列表中,其中包含重复元素的排列组合。

huangapple go评论79阅读模式
英文:

Dataframe: Permutations after grouping column data into a dictionary/list with recurring elements

问题

我正在努力将大量数据集转化为压缩行,以便能够对其进行排列组合并获取所有可能的组合。

基本上结构如下:

    PLOT    SEED    SUB    EDGE    PT    0    1    2    3
1    117    1    1    1    2    8    None    None    None
2    117    1    1    1    3    8    None    None    None
3    117    1    1    2    1    8    None    None    None
4    117    1    1    2    2    6    7    8    None
5    117    1    2    1    1    8    None    None    None

我不确定我应该使用列表还是字典,因为这些数据未来可能会有数千甚至数百万行。

想法是将 ['SUB','EDGE','PT'] 的值添加到 [0:] 列中的每个值(它不一定会以3结束),所以例如第4行将会是:

    PLOT    SEED    SUB    TYPES
4    117    1    1    [1,2,2,6],[1,2,2,7],[1,2,2,8]

或者字典:

    PLOT    SEED    SUB    TYPES
4    117    1    1    {'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':6},{'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':7},{'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':8}

然后在 ['PLOT','SEED'] 列中的每一个 ['SUB'] 上运行排列组合。

这意味着如果 Plot 117, Seed 12 个 Subs,它将对 Sub1 中的所有选项与 Sub1 中的所有选项之间的所有可能组合进行运算,同时保留 ['PLOT','SEED'] 数据。所以在这种情况下(基于上面的原始5行表):

# 第一个 Sub:
[1,1,2,8],[1,1,3,8],[1,2,1,8],[1,2,2,6],[1,2,2,7],[1,2,2,8]
# 第二个 Sub:
[2,1,1,8]

我应该得到的是:

# Plot 117 Seed 1 排列组合:
[[1,1,2,8],[2,1,1,8]], [[1,1,3,8],[2,1,1,8]], [[1,2,1,8],[2,1,1,8]], ...
或者用字典表示:
[{'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':6},{'SUB':2, 'EDGE':1, 'PT':1, 'TYPE':8}], ...

显然,这需要对每一个 ['PLOT','SEED'] 进行整个数据集的迭代,而且在许多情况下,每个 PLOT+SEED 都会有超过 2 个 'SUB'。

我尝试了许多方法,比如 'groupby':

df_grouped_edge = df_joined.groupby(['PLOT','SEED','SUB','EDGE'], as_index=False).agg(np.array)
df_grouped_edge

我得到了:

    PLOT    SEED    SUB    EDGE    PT    0    1    2    3
0    117    1    1    1    [2, 3]    [8, 8]    [None, None]    [None, None]    [None, None]
1    117    1    1    2    [1, 2, 3]    [8, 6, 8]    [None, 7, None]    [None, 8, None]    [None, None, None]

这不是我要的,或者至少我认为不是,我不知道如何处理这种输出。过去两天里,我尝试了许多其他分组的方法,但似乎无法让它工作,我明白了对大型数据集进行循环遍历是一种非常错误的工作方式(我应该将其矢量化,只是我还不明白如何做)。

我希望我解释得清楚,我知道这是一个很大的问题,但我真的不知所措。很快就要开始学习 SQL 了,但我还没有掌握,我必须完成这个任务。

非常感谢!
Lev

英文:

I'm struggling with manipulating large sets of data into condensed rows that will allow me to run permutations on and get all possible combinations.
Basically this is the structure:

	PLOT	SEED	SUB	EDGE	PT	0	1	2	3
1	117	1	1	1	2	8	None	None	None
2	117	1	1	1	3	8	None	None	None
3	117	1	1	2	1	8	None	None	None
4	117	1	1	2	2	6	7	8	None
5	117	1	2	1	1	8	None	None	None

I'm not sure if I should use lists or dictionaries, as this data should have thousands to even millions of rows in the future.

The idea is to add the ['SUB','EDGE','PT'] values to each value in the [0:] columns (it won't necessarily end at 3), so that for example row number 4 will look like:

	PLOT	SEED	SUB	TYPES
4	117	1	1	[1,2,2,6],[1,2,2,7],[1,2,2,8]

OR dictionaries:

	PLOT	SEED	SUB	TYPES
4	117	1	1	{'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':6},{'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':7},{'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':8}

And then run permutations on each ['PLOT','SEED'] using the ['SUB'] column.
That means that if Plot 117, Seed 1 has 2 Subs, it will run all possible combinations between all the options in Sub1 and all the options in Sub1, while keeping the ['PLOT','SEED'] data. So in this case (building on the original 5 row table up top):

# 1st Sub:
[1,1,2,8],[1,1,3,8],[1,2,1,8],[1,2,2,6],[1,2,2,7],[1,2,2,8]
# 2nd Sub:
[2,1,1,8]

And What I should get is:

#Plot 117 Seed 1 Permutations:
[[1,1,2,8],[2,1,1,8]], [[1,1,3,8],[2,1,1,8]], [[1,2,1,8],[2,1,1,8]], ...
Or with Dictionaries:
[{'SUB':1, 'EDGE':2, 'PT':2, 'TYPE':6},{'SUB':2, 'EDGE':1, 'PT':1, 'TYPE':8}], ...

Obviously this needs to iterate over the entire dataset for each ['PLOT','SEED'], of which there will be many, and also in many cases there will be more than 2 'SUB' per PLOT+SEED.

I tried many things like 'groupby', for example:

df_grouped_edge = df_joined.groupby(['PLOT','SEED','SUB','EDGE'], as_index=False).agg(np.array)
df_grouped_edge

For which I get:

	PLOT	SEED	SUB	EDGE	PT	0	1	2	3
0	117	1	1	1	[2, 3]	[8, 8]	[None, None]	[None, None]	[None, None]
1	117	1	1	2	[1, 2, 3]	[8, 6, 8]	[None, 7, None]	[None, 8, None]	[None, None, None]

This is not what I'm looking for, or at least I don't think it is and I don't know how to work with that output. Tried many other grouping attempts the last two days but I can't seem to get it work, and I understand that looping over large datasets per item/column is a very wrong way of working (and I should vectorize it, I just don't understand how yet).

I hope I explained myself well, I know this is a big question but I'm really at a loss.
Starting to learn SQL very soon but I'm not there yet, and I have to get this done.

Thank you very much!
Lev

答案1

得分: 1

首先,按列 ['PLOT', 'SEED'] 对数据进行分组。这将允许你在数据集中迭代每个唯一的 ['PLOT', 'SEED'] 组合。

grouped_data = df.groupby(['PLOT', 'SEED'])

定义一个函数,该函数接受一个组,并执行生成组合所需的操作。将使用 apply() 将此函数应用于每个组。

import itertools

def generate_combinations(group):
    sub_values = group['SUB'].unique()
    combinations = list(itertools.product(*sub_values))
    
    # 创建一个新的 DataFrame 以存储组合
    result_df = pd.DataFrame(columns=df.columns)
    
    for combination in combinations:
        # 使用组合值创建一个新行
        new_row = group.iloc[0].copy()
        new_row[['SUB', 'EDGE', 'PT']] = combination
        
        # 将行附加到结果 DataFrame
        result_df = result_df.append(new_row, ignore_index=True)
    
    return result_df

generate_combinations 函数应用于分组数据中的每个组,并将结果串联起来。

result = pd.concat([generate_combinations(group) for _, group in grouped_data], ignore_index=True)

现在,结果应该包含一个包含所有可能组合的 DataFrame,对应于每个唯一的 ['PLOT', 'SEED'] 组合,基于 'SUB' 列中的值。

英文:

First, group your data by the columns ['PLOT', 'SEED']. This will allow you to iterate over each unique combination of ['PLOT', 'SEED'] in your dataset.

grouped_data = df.groupby(['PLOT', 'SEED'])

Define a function that takes a group and performs the necessary operations to generate the combinations. This function will be applied to each group using apply().

import itertools

def generate_combinations(group):
    sub_values = group['SUB'].unique()
    combinations = list(itertools.product(*sub_values))
    
    # Create a new DataFrame to store the combinations
    result_df = pd.DataFrame(columns=df.columns)
    
    for combination in combinations:
        # Create a new row with the combination values
        new_row = group.iloc[0].copy()
        new_row[['SUB', 'EDGE', 'PT']] = combination
        
        # Append the row to the result DataFrame
        result_df = result_df.append(new_row, ignore_index=True)
    
    return result_df

Apply the generate_combinations function to each group in the grouped data and concatenate the results.

result = pd.concat([generate_combinations(group) for _, group in grouped_data], ignore_index=True)

Now, the result should contain the DataFrame with all the possible combinations for each unique ['PLOT', 'SEED'] combination, based on the values in the 'SUB' column.

答案2

得分: 0

使用.melt()解决了这个问题:

df.melt(id_vars=["PLOT", "SEED", "SUB", "EDGE", "PT"], 
        var_name="DROP", 
        value_name="TYPO",
        ignore_index=True) # 这里False会保留原始索引的副本
df=df.drop('DROP', axis=1)
# 删除没有匹配错字的情景
df=df.dropna()

在将每个错字值作为一行获取后,我使用了.groupby(['PLOT', 'SEED']转换成字典,然后对字典中的每个条目进行迭代,并使用groupby('SUB'将所有内容放置在合适的位置。

然后在一个大列表中的每个字典中,我只使用了包含所有选项的列表的键,并使用itertools.product获取所有这些列表之间的组合。

显然,这需要对数据结构进行大量的工作,将相应的数据重新添加到每个字典中,可能不是最干净的工作,但它完成了任务。

我现在的问题是,当数据集变得太大时,.product停止工作并出现了MemoryError:
链接

英文:

I ended up solving it using .melt():

df.melt(id_vars=["PLOT", "SEED", "SUB", "EDGE", "PT"], 
        var_name="DROP", 
        value_name="TYPO",
        ignore_index=True) #False here leaves duplicates of the original indexes
df=df.drop('DROP', axis=1)
# Drop Scenarios with no typos matching
df=df.dropna()

After getting each typo value as a row, I used .groupby(['PLOT', 'SEED'] into dictionaries, then iterated over each entry in the dict and used groupby('SUB'... everything in place.

Then in each dict in a large list I only used the key that has the list of lists (with all the options) and I used itertools.product to get all the combinations between said lists.

Obviously this required a lot of work on the data structures, adding back the corresponding data into each dict, probably not the cleanest work, but it did the job.

My problem now is that .product stopped working due to MemoryError when the dataset became too big:
https://stackoverflow.com/questions/76559405/using-np-meshgrid-on-list-of-lists-with-different-lengths-in-a-large-dataset-in

huangapple
  • 本文由 发表于 2023年6月22日 01:23:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76525764.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定