英文:
PyTorch Geometric - How to sample huge graph to train GNN with mini-batching
问题
I want to do node regression on a huge graph (around 1M nodes) using PyTorch Geometric, but I cannot create a Data
object because the full graph does not fit in RAM, so I cannot use the DataLoader
class for mini-batching and training.
Some examples (such as 4. Scaling Graph Neural Networks) introduce the Clusterdata
and ClusterLoader
classes, but this does not help my case because they can actually load the entire graph.
I have already pre-computed node embeddings and edges into separate files, which I can read very fast, to load graph subsets and the embeddings of specific nodes. However, I do not know how I should sample the graph during training, or if any existing PyTorch modules do this already.
My question is: Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory? If not, how should I do the graph sampling?
In the PyTorch Geometric docs there are many examples of node and graph regression, classification... but none of the examples explain how to handle such large graphs, as they use datasets composed of many small graphs which all fit in RAM.
In another Google Colab notebook example (2. Node Classification with Graph Neural Networks), the entire graph from an existing dataset (Planetoid) is loaded in RAM.
data = dataset[0] # Get the first graph object.
Then later, a train
function for one training epoch of the model is defined, which uses the full data.x
and data.edge_index
of the graph.
model.train()
optimizer.zero_grad() # Clear gradients.
out = model(data.x, data.edge_index) # Perform a single forward pass.
loss = criterion(out[data.train_mask], data.y[data.train_mask]) # Compute the loss solely based on the training nodes.
loss.backward() # Derive gradients.
optimizer.step() # Update parameters based on gradients.
return loss
From this example, I guess that each mini-batch from my graph should be created by:
- Selecting a random subset of nodes from my graph.
- Reading all edges and node embeddings involving the selected nodes.
And then, train one epoch on this mini-batch. However, what if none of my randomly selected nodes are adjacent, and no message passing is done? Is there a proper way to sample from this graph without this happening? For instance, could we pick one random node and then take as subset some neighborhood?
英文:
I want to do node regression on a huge graph (around 1M nodes) using PyTorch Geometric, but I cannot create a Data
object because the full graph does not fin in RAM, so I cannot use the DataLoader
class for mini-batching and training.
Some examples (such as 4. Scaling Graph Neural Networks) introduce the Clusterdata
and ClusterLoader
classes, but this does not help my case because they can actually load the entire graph.
I have already pre-computed node embeddings and edges into separate files, which I can read very fast, to load graph subsets and the embeddings of specific nodes. However, I do not know how I should sample the graph during training, or if any existing PyTorch modules do this already.
My question is: Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory? If not, how should I do the graph sampling?
In the PyTorch Geometric docs there are many examples of node and graph regression, classification... but none of the examples explain how to handle such large graphs, as they use datasets composed of many small graphs which all fit in RAM.
In another Google Colab notebook example (2. Node Classification with Graph Neural Networks), the entire graph from an existing dataset (Planetoid) is loaded in RAM.
dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures())
...
data = dataset[0] # Get the first graph object.
Then later, a train
function for one training epoch of the model is defined, which uses the full data.x
and data.edge_index
of the graph.
def train():
model.train()
optimizer.zero_grad() # Clear gradients.
out = model(data.x, data.edge_index) # Perform a single forward pass.
loss = criterion(out[data.train_mask], data.y[data.train_mask]) # Compute the loss solely based on the training nodes.
loss.backward() # Derive gradients.
optimizer.step() # Update parameters based on gradients.
return loss
From this example, I guess that each mini-batch from my graph should be created by:
- Selecting a random subset of nodes from my graph.
- Reading all edges and node embeddings involving the selected nodes.
And then, train one epoch on this mini-batch. However, what if none of my randomly selected nodes are adjacent, and no message passing is done? Is there a proper way to sample from this graph without this happening? For instance, could we pick one random node and then take as subset some neighborhood?
答案1
得分: 1
是的,可以使用NeighborLoaders来创建小批次以训练您的GCN,而无需将整个图加载到内存中。
如果您有一个使用Data对象类型的图,您可以将其提供给NeighborLoader对象。
loader = NeighborLoader(
data,
# 为每个节点采样30个邻居,迭代2次
num_neighbors=[30] * 2,
# 使用批量大小为128来采样训练节点
batch_size=128,
input_nodes=data.train_mask,
)
英文:
> Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory?
Yes, by using NeighborLoaders.
If you have a graph using Data object type, you can feed it to a NeighborLoader object.
loader = NeighborLoader(
data,
# Sample 30 neighbors for each node for 2 iterations
num_neighbors=[30] * 2,
# Use a batch size of 128 for sampling training nodes
batch_size=128,
input_nodes=data.train_mask,
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论