2023年5月14日 09:38:20go评论132阅读模式

英文:

PyTorch Geometric - How to sample huge graph to train GNN with mini-batching

问题

I want to do node regression on a huge graph (around 1M nodes) using PyTorch Geometric, but I cannot create a Data object because the full graph does not fit in RAM, so I cannot use the DataLoader class for mini-batching and training.

Some examples (such as 4. Scaling Graph Neural Networks) introduce the Clusterdata and ClusterLoader classes, but this does not help my case because they can actually load the entire graph.

I have already pre-computed node embeddings and edges into separate files, which I can read very fast, to load graph subsets and the embeddings of specific nodes. However, I do not know how I should sample the graph during training, or if any existing PyTorch modules do this already.

My question is: Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory? If not, how should I do the graph sampling?

In the PyTorch Geometric docs there are many examples of node and graph regression, classification... but none of the examples explain how to handle such large graphs, as they use datasets composed of many small graphs which all fit in RAM.

In another Google Colab notebook example (2. Node Classification with Graph Neural Networks), the entire graph from an existing dataset (Planetoid) is loaded in RAM.

data = dataset[0]  # Get the first graph object.

Then later, a train function for one training epoch of the model is defined, which uses the full data.x and data.edge_index of the graph.

      model.train()
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

From this example, I guess that each mini-batch from my graph should be created by:

Selecting a random subset of nodes from my graph.
Reading all edges and node embeddings involving the selected nodes.

And then, train one epoch on this mini-batch. However, what if none of my randomly selected nodes are adjacent, and no message passing is done? Is there a proper way to sample from this graph without this happening? For instance, could we pick one random node and then take as subset some neighborhood?

英文:

I want to do node regression on a huge graph (around 1M nodes) using PyTorch Geometric, but I cannot create a Data object because the full graph does not fin in RAM, so I cannot use the DataLoader class for mini-batching and training.

Some examples (such as 4. Scaling Graph Neural Networks) introduce the Clusterdata and ClusterLoader classes, but this does not help my case because they can actually load the entire graph.

My question is: Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory? If not, how should I do the graph sampling?

In another Google Colab notebook example (2. Node Classification with Graph Neural Networks), the entire graph from an existing dataset (Planetoid) is loaded in RAM.

dataset = Planetoid(root=&#39;data/Planetoid&#39;, name=&#39;Cora&#39;, transform=NormalizeFeatures())
...
data = dataset[0]  # Get the first graph object.

Then later, a train function for one training epoch of the model is defined, which uses the full data.x and data.edge_index of the graph.

def train():
      model.train()
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

From this example, I guess that each mini-batch from my graph should be created by:

Selecting a random subset of nodes from my graph.
Reading all edges and node embeddings involving the selected nodes.

答案1

得分: 1

是的，可以使用NeighborLoaders来创建小批次以训练您的GCN，而无需将整个图加载到内存中。

如果您有一个使用Data对象类型的图，您可以将其提供给NeighborLoader对象。

loader = NeighborLoader(
    data,
    # 为每个节点采样30个邻居，迭代2次
    num_neighbors=[30] * 2,
    # 使用批量大小为128来采样训练节点
    batch_size=128,
    input_nodes=data.train_mask,
)

英文:

> Are there any modules from PyTorch Geometric that can create mini batches to train my GCN without loading the entire graph in memory?

Yes, by using NeighborLoaders.

If you have a graph using Data object type, you can feed it to a NeighborLoader object.

loader = NeighborLoader(
    data,
    # Sample 30 neighbors for each node for 2 iterations
    num_neighbors=[30] * 2,
    # Use a batch size of 128 for sampling training nodes
    batch_size=128,
    input_nodes=data.train_mask,
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PyTorch Geometric – 如何对大型图进行采样以使用小批量训练 GNN

问题

答案1

ray tune batch_size should be a positive integer value, but got batch_size=<ray.tune.search.sample.Categorical object

MellowMax 运算符返回 +INF

在Go语言中查找有向图G中的所有循环。

Feature extraction process using too much memory and causing a crash. What can I do?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论