PyTorch Geometric – 如何对大型图进行采样以使用小批量训练 GNN

huangapple go评论60阅读模式
英文:

PyTorch Geometric - How to sample huge graph to train GNN with mini-batching

问题

I want to do node regression on a huge graph (around 1M nodes) using PyTorch Geometric, but I cannot create a Data object because the full graph does not fit in RAM, so I cannot use the DataLoader class for mini-batching and training.

Some examples (such as 4. Scaling Graph Neural Networks) introduce the Clusterdata and ClusterLoader classes, but this does not help my case because they can actually load the entire graph.

I have already pre-computed node embeddings and edges into separate files, which I can read very fast, to load graph subsets and the embeddings of specific nodes. However, I do not know how I should sample the graph during training, or if any existing PyTorch modules do this already.

My question is: Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory? If not, how should I do the graph sampling?

In the PyTorch Geometric docs there are many examples of node and graph regression, classification... but none of the examples explain how to handle such large graphs, as they use datasets composed of many small graphs which all fit in RAM.

In another Google Colab notebook example (2. Node Classification with Graph Neural Networks), the entire graph from an existing dataset (Planetoid) is loaded in RAM.

data = dataset[0]  # Get the first graph object.

Then later, a train function for one training epoch of the model is defined, which uses the full data.x and data.edge_index of the graph.

      model.train()
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

From this example, I guess that each mini-batch from my graph should be created by:

  1. Selecting a random subset of nodes from my graph.
  2. Reading all edges and node embeddings involving the selected nodes.

And then, train one epoch on this mini-batch. However, what if none of my randomly selected nodes are adjacent, and no message passing is done? Is there a proper way to sample from this graph without this happening? For instance, could we pick one random node and then take as subset some neighborhood?

英文:

I want to do node regression on a huge graph (around 1M nodes) using PyTorch Geometric, but I cannot create a Data object because the full graph does not fin in RAM, so I cannot use the DataLoader class for mini-batching and training.

Some examples (such as 4. Scaling Graph Neural Networks) introduce the Clusterdata and ClusterLoader classes, but this does not help my case because they can actually load the entire graph.

I have already pre-computed node embeddings and edges into separate files, which I can read very fast, to load graph subsets and the embeddings of specific nodes. However, I do not know how I should sample the graph during training, or if any existing PyTorch modules do this already.

My question is: Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory? If not, how should I do the graph sampling?

In the PyTorch Geometric docs there are many examples of node and graph regression, classification... but none of the examples explain how to handle such large graphs, as they use datasets composed of many small graphs which all fit in RAM.

In another Google Colab notebook example (2. Node Classification with Graph Neural Networks), the entire graph from an existing dataset (Planetoid) is loaded in RAM.

dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures())
...
data = dataset[0]  # Get the first graph object.

Then later, a train function for one training epoch of the model is defined, which uses the full data.x and data.edge_index of the graph.

def train():
      model.train()
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

From this example, I guess that each mini-batch from my graph should be created by:

  1. Selecting a random subset of nodes from my graph.
  2. Reading all edges and node embeddings involving the selected nodes.

And then, train one epoch on this mini-batch. However, what if none of my randomly selected nodes are adjacent, and no message passing is done? Is there a proper way to sample from this graph without this happening? For instance, could we pick one random node and then take as subset some neighborhood?

答案1

得分: 1

是的,可以使用NeighborLoaders来创建小批次以训练您的GCN,而无需将整个图加载到内存中。

如果您有一个使用Data对象类型的图,您可以将其提供给NeighborLoader对象。

loader = NeighborLoader(
    data,
    # 为每个节点采样30个邻居,迭代2次
    num_neighbors=[30] * 2,
    # 使用批量大小为128来采样训练节点
    batch_size=128,
    input_nodes=data.train_mask,
)
英文:

> Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory?

Yes, by using NeighborLoaders.

If you have a graph using Data object type, you can feed it to a NeighborLoader object.

loader = NeighborLoader(
    data,
    # Sample 30 neighbors for each node for 2 iterations
    num_neighbors=[30] * 2,
    # Use a batch size of 128 for sampling training nodes
    batch_size=128,
    input_nodes=data.train_mask,
)

huangapple
  • 本文由 发表于 2023年5月14日 09:38:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76245486.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定