我该如何将DataFrame按照PyTorch Geometric的节点索引重新排列?

huangapple go评论86阅读模式
英文:

How can I rearrange df as the nodes index in pytorch geometric manner?

问题

我想按照PyTorch Geometric的方式重新排列我的DataFrame(data),将节点索引转换为原始名称以提取节点嵌入。

以下是代码的翻译部分:

import pandas as pd

data = {'Source': ['Rainfall', 'SP2', 'SP2', 'Inflow', 'Rainfall', 'Inflow', 'Inflow', 'Inflow', 'SWT', 'SP1', 'SP1', 'SWD'],
        'Target': ['SP1', 'Evp', 'Outflow', 'SP2', 'SWD', 'SWD', 'SP2', 'SP1', 'SP1', 'SP2', 'Evp', 'Loss']}
df = pd.DataFrame(data)
nodes = pd.concat([df['Source'], df['Target']]).unique()
node_indices = {node: i for i, node in enumerate(nodes)}
df['Source'] = df['Source'].map(node_indices)
df['Target'] = df['Target'].map(node_indices)

这是我的预期输出:

预期输出:
我该如何将DataFrame按照PyTorch Geometric的节点索引重新排列?

感谢任何意见或建议。

英文:

I'd like to rearrange my dataframe (data) as the node index to the original name in pytorch geometric manner for extracting node embedding.

import pandas as pd


data = {'Source': ['Rainfall', 'SP2', 'SP2', 'Inflow','Rainfall','Inflow', 'Inflow', 'Inflow','SWT','SP1','SP1','SWD'],
       'Target': ['SP1', 'Evp', 'Outflow', 'SP2','SWD','SWD', 'SP2','SP1','SP1','SP2','Evp','Loss']}  
df = pd.DataFrame(data)
nodes = pd.concat([df['Source'], df['Target']]).unique()
node_indices = {node: i for i, node in enumerate(nodes)}
df['Source'] = df['Source'].map(node_indices)
df['Target'] = df['Target'].map(node_indices)

This is my expected outputs

Expected outputs:
我该如何将DataFrame按照PyTorch Geometric的节点索引重新排列?

Appreciate any though or suggestions.

答案1

得分: 0

我假设目标是获得一些参数的中间编码,并制作源值和目标值之间关系的可视化表示(似乎都来自同一个池,即此案例中的“节点”)。

data = {
    'Source': ['Rainfall', 'SP2', 'SP2', 'Inflow', 'Rainfall', 'Inflow', 'Inflow', 'Inflow', 'SWT', 'SP1', 'SP1', 'SWD'],
    'Target': ['SP1', 'Evp', 'Outflow', 'SP2', 'SWD', 'SWD', 'SP2', 'SP1', 'SP1', 'SP2', 'Evp', 'Loss']
}  
df = pd.DataFrame(data)

# 获取数据的所有唯一值
nodes = {name: code for code, name in enumerate({*df.values.flat})}

# 用它们的编码替换值
df_coded = df.replace(nodes)

# 将原始数据和编码数据连接到一个DataFrame中
# 使用多级标头以便按名称和代码分别分隔列
df_repr = pd.concat([df, df_coded], axis=1, keys=['Name','Code'])

# 通过名称和代码重排列列
df_repr = df_repr.iloc[:, [0, 2, 3, 1]]

# 显示转置表示,隐藏原始索引
print(df_repr.T.to_string(header=False))

通过这个代码,我们获得了以下输出:

我该如何将DataFrame按照PyTorch Geometric的节点索引重新排列?

更新

在行df_repr = df_repr.iloc[:, [0,2,3,1]]之后,这是按其索引重新排列列的方式。这里[0,2,3,1]表示“将第二列(索引为1)放在最后”。我们也可以使用.loc来实现,只需按所需顺序传递列的名称。在这种情况下,使用.loc会更长一些:

df_repr = df_repr.loc[:, [('Name', 'Source'), 
                          ('Code', 'Source'), 
                          ('Code', 'Target'), 
                          ('Name', 'Target')]]

但如果可读性是首要任务,那当然最好使用loc而不是iloc

P.S.因为我们这里只操作列,所以下面的方法也有效:

reordered_columns = [
    ('Name', 'Source'), 
    ('Code', 'Source'), 
    ('Code', 'Target'), 
    ('Name', 'Target')
]
df_repr = df_repr[reordered_columns]
英文:

I'm assuming the goal is to get an intermediate encoding of some parameters and make a visual representation of the relationships between source and target values (which seem to be all from the same pool, i.e. nodes in this case).

data = {
    'Source': ['Rainfall', 'SP2', 'SP2', 'Inflow','Rainfall','Inflow', 'Inflow', 'Inflow','SWT','SP1','SP1','SWD'],
    'Target': ['SP1', 'Evp', 'Outflow', 'SP2','SWD','SWD', 'SP2','SP1','SP1','SP2','Evp','Loss']
}  
df = pd.DataFrame(data)

# get all unique values of data
nodes = {name: code for code, name in enumerate({*df.values.flat})}

# replace values with their codes
df_coded = df.replace(nodes)

# connect original and encoded data in one DataFrame
# use multilevel headers to ease separate columns by names and codes
df_repr = pd.concat([df, df_coded], axis=1, keys=['Name','Code'])

# rearange columns like (source|name, source|code, target|code, target|name)
df_repr = df_repr.iloc[:, [0,2,3,1]]

# display transposed representation with hidden original indexes
print(df_repr.T.to_string(header=False))

With this we obtain the following output:

我该如何将DataFrame按照PyTorch Geometric的节点索引重新排列?

Update

As of the line df_repr = df_repr.iloc[:, [0,2,3,1]], this is reordering of columns by their indexes. Here [0,2,3,1] means put the second column (which is indexed by 1) at the very end. We can do it with .loc as well by passing names of columns in a desired order. It's just gonna be somewhat longer in this case:

df_repr = df_repr.loc[:, [('Name','Source'), 
                          ('Code','Source'), 
                          ('Code','Target'), 
                          ('Name','Target')]]

But if readability is a priority, then of course it's better to use loc instead of iloc.

P.S. Because we are manipulating here only with columns, the following also works:

reordered_columns = [
    ('Name','Source'), 
    ('Code','Source'), 
    ('Code','Target'), 
    ('Name','Target')
]
df_repr = df_repr[reordered_columns]

huangapple
  • 本文由 发表于 2023年7月31日 18:54:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76802927.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定