优化使用OpenMP进行大规模图遍历

huangapple go评论66阅读模式
英文:

Optimizing huge graph traversal with OpenMP

问题

以下是翻译好的部分:

"I am trying to optimize this function which according to the perf tool is the bottleneck of archiving close to linear scaling. The performance gets worse when the number of threads go up, when I drill down the assembly code generated by perf it shows most of the time is spent checking for visited and not visited vertices. I've done a ton of google searches to improve the performance to no avail. Is there a way to improve the performance of this function? Or is there a thread safe way of implementing this function? Thanks for your help in advance!

typedef uint32_t vidType;
template<typename T, typename U, typename V>
bool compare_and_swap(T &x, U old_val, V new_val) {
     return __sync_bool_compare_and_swap(&x, old_val, new_val);
 }

template<bool map_vertices, bool map_edges>
VertexSet GraphT<map_vertices, map_edges>::N(vidType vid) const {
  assert(vid >= 0);
  assert(vid < n_vertices);
  eidType begin = vertices[vid], end = vertices[vid+1];
  if (begin > end or end > n_edges) {
    fprintf(stderr, "vertex %u bounds error: [%lu, %lu)\n", vid, begin, end);
    exit(1);
  }
  assert(end <= n_edges);
  return VertexSet(edges + begin, end - begin, vid);
}

void bfs_step(Graph &g, vidType *depth, SlidingQueue<vidType> &queue) {
  #pragma omp parallel
  {
    QueueBuffer<vidType> lqueue(queue);

    #pragma omp for
    
    for (auto q_iter = queue.begin(); q_iter < queue.end(); q_iter++) {
      auto src = *q_iter;
      for (auto dst : g.N(src)) {
        //int curr_val = parent[dst];
        auto curr_val = depth[dst];
        if (curr_val == MYINFINITY) { // not visited
          //if (compare_and_swap(parent[dst], curr_val, src)) { 
          if (compare_and_swap(depth[dst], curr_val, depth[src] + 1)) {
            lqueue.push_back(dst);
          }
        }
      }
    }
    lqueue.flush();
  }
}

希望这对你有所帮助。

英文:

I am trying to optimize this function which according to the perf tool is the bottleneck of archiving close to linear scaling. The performance gets worse when the number of threads go up, when I drill down the assembly code generated by perf it shows most of the time is spent checking for visited and not visited vertices. I've done a ton of google searches to improve the performance to no avail. Is there a way to improve the performance of this function? Or is there a thread safe way of implementing this function? Thanks for your help in advance!

typedef uint32_t vidType;
template<typename T, typename U, typename V>
bool compare_and_swap(T &x, U old_val, V new_val) {
     return __sync_bool_compare_and_swap(&x, old_val, new_val);
 }

template<bool map_vertices, bool map_edges>
VertexSet GraphT<map_vertices, map_edges>::N(vidType vid) const {
  assert(vid >= 0);
  assert(vid < n_vertices);
  eidType begin = vertices[vid], end = vertices[vid+1];
  if (begin > end || end > n_edges) {
    fprintf(stderr, "vertex %u bounds error: [%lu, %lu)\n", vid, begin, end);
    exit(1);
  }
  assert(end <= n_edges);
  return VertexSet(edges + begin, end - begin, vid);
}

void bfs_step(Graph &g, vidType *depth, SlidingQueue<vidType> &queue) {
  #pragma omp parallel
  {
    QueueBuffer<vidType> lqueue(queue);

    #pragma omp for
    
    for (auto q_iter = queue.begin(); q_iter < queue.end(); q_iter++) {
      auto src = *q_iter;
      for (auto dst : g.N(src)) {
        //int curr_val = parent[dst];
        auto curr_val = depth[dst];
        if (curr_val == MYINFINITY) { // not visited
          //if (compare_and_swap(parent[dst], curr_val, src)) { 
          if (compare_and_swap(depth[dst], curr_val, depth[src] + 1)) {
            lqueue.push_back(dst);
          }
        }
      }
    }
    lqueue.flush();
  }
}

答案1

得分: 2

首先,你正在使用非常传统的图算法表达方式。适合教科书,但不适合计算。如果你将其写成与邻接矩阵的广义矩阵-向量乘积,你就可以摆脱所有那些繁琐的队列,并且并行性变得非常明显。

在你的表达中,问题出在队列的 push_back 函数上。这很难并行化。解决方案是让每个线程有自己的队列,然后使用归约。如果你定义队列对象上的加法运算符以实现本地队列的合并,那就可以解决这个问题。

英文:

First of all, you're using a very traditional formulation of graph algorithms. Good for textbooks, not for computation. If you write this as a generalized matrix-vector product with the adjacency matrix you lose all those fiddly queues and the parallelism becomes quite obvious.

In your formulation, the problem is with the push_back function on the queue. That is hard to parallelize. The solution is to let each thread have its own queue, and then using a reduction. This works if you define the plus operator on your queue object to effect a merge of the local queues.

huangapple
  • 本文由 发表于 2023年2月19日 03:21:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75495816.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定