问题

以下是翻译好的部分：

"I am trying to optimize this function which according to the perf tool is the bottleneck of archiving close to linear scaling. The performance gets worse when the number of threads go up, when I drill down the assembly code generated by perf it shows most of the time is spent checking for visited and not visited vertices. I've done a ton of google searches to improve the performance to no avail. Is there a way to improve the performance of this function? Or is there a thread safe way of implementing this function? Thanks for your help in advance!

typedef uint32_t vidType;
template&lt;typename T, typename U, typename V&gt;
bool compare_and_swap(T &amp;x, U old_val, V new_val) {
     return __sync_bool_compare_and_swap(&amp;x, old_val, new_val);
 }

template&lt;bool map_vertices, bool map_edges&gt;
VertexSet GraphT&lt;map_vertices, map_edges&gt;::N(vidType vid) const {
  assert(vid &gt;= 0);
  assert(vid &lt; n_vertices);
  eidType begin = vertices[vid], end = vertices[vid+1];
  if (begin &gt; end or end &gt; n_edges) {
    fprintf(stderr, &quot;vertex %u bounds error: [%lu, %lu)\n&quot;, vid, begin, end);
    exit(1);
  }
  assert(end &lt;= n_edges);
  return VertexSet(edges + begin, end - begin, vid);
}

void bfs_step(Graph &amp;g, vidType *depth, SlidingQueue&lt;vidType&gt; &amp;queue) {
  #pragma omp parallel
  {
    QueueBuffer&lt;vidType&gt; lqueue(queue);

    #pragma omp for
    
    for (auto q_iter = queue.begin(); q_iter &lt; queue.end(); q_iter++) {
      auto src = *q_iter;
      for (auto dst : g.N(src)) {
        //int curr_val = parent[dst];
        auto curr_val = depth[dst];
        if (curr_val == MYINFINITY) { // not visited
          //if (compare_and_swap(parent[dst], curr_val, src)) { 
          if (compare_and_swap(depth[dst], curr_val, depth[src] + 1)) {
            lqueue.push_back(dst);
          }
        }
      }
    }
    lqueue.flush();
  }
}

希望这对你有所帮助。

英文:

I am trying to optimize this function which according to the perf tool is the bottleneck of archiving close to linear scaling. The performance gets worse when the number of threads go up, when I drill down the assembly code generated by perf it shows most of the time is spent checking for visited and not visited vertices. I've done a ton of google searches to improve the performance to no avail. Is there a way to improve the performance of this function? Or is there a thread safe way of implementing this function? Thanks for your help in advance!

typedef uint32_t vidType;
template&lt;typename T, typename U, typename V&gt;
bool compare_and_swap(T &amp;x, U old_val, V new_val) {
     return __sync_bool_compare_and_swap(&amp;x, old_val, new_val);
 }

template&lt;bool map_vertices, bool map_edges&gt;
VertexSet GraphT&lt;map_vertices, map_edges&gt;::N(vidType vid) const {
  assert(vid &gt;= 0);
  assert(vid &lt; n_vertices);
  eidType begin = vertices[vid], end = vertices[vid+1];
  if (begin &gt; end || end &gt; n_edges) {
    fprintf(stderr, &quot;vertex %u bounds error: [%lu, %lu)\n&quot;, vid, begin, end);
    exit(1);
  }
  assert(end &lt;= n_edges);
  return VertexSet(edges + begin, end - begin, vid);
}

void bfs_step(Graph &amp;g, vidType *depth, SlidingQueue&lt;vidType&gt; &amp;queue) {
  #pragma omp parallel
  {
    QueueBuffer&lt;vidType&gt; lqueue(queue);

    #pragma omp for
    
    for (auto q_iter = queue.begin(); q_iter &lt; queue.end(); q_iter++) {
      auto src = *q_iter;
      for (auto dst : g.N(src)) {
        //int curr_val = parent[dst];
        auto curr_val = depth[dst];
        if (curr_val == MYINFINITY) { // not visited
          //if (compare_and_swap(parent[dst], curr_val, src)) { 
          if (compare_and_swap(depth[dst], curr_val, depth[src] + 1)) {
            lqueue.push_back(dst);
          }
        }
      }
    }
    lqueue.flush();
  }
}

答案1

得分: 2

首先，你正在使用非常传统的图算法表达方式。适合教科书，但不适合计算。如果你将其写成与邻接矩阵的广义矩阵-向量乘积，你就可以摆脱所有那些繁琐的队列，并且并行性变得非常明显。

在你的表达中，问题出在队列的 push_back 函数上。这很难并行化。解决方案是让每个线程有自己的队列，然后使用归约。如果你定义队列对象上的加法运算符以实现本地队列的合并，那就可以解决这个问题。

英文:

First of all, you're using a very traditional formulation of graph algorithms. Good for textbooks, not for computation. If you write this as a generalized matrix-vector product with the adjacency matrix you lose all those fiddly queues and the parallelism becomes quite obvious.

In your formulation, the problem is with the push_back function on the queue. That is hard to parallelize. The solution is to let each thread have its own queue, and then using a reduction. This works if you define the plus operator on your queue object to effect a merge of the local queues.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

优化使用OpenMP进行大规模图遍历

问题

答案1

C++ 十进制转二进制的转换

使用C++中的OpenCV矩阵和Eigen旋转图像90度。

如何修复Metal中2D对象重叠绘制的问题（模板，剪裁）？

从C++中的字符串中获取特定值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论