英文:
How do I parallelize this matrix by vector multiplication?
问题
我需要并行化这段代码。正如您所看到的,q
数组包含在每个循环中求和的值。
/// <summary>
/// q = A * d,A以CSR格式存储。
/// </summary>
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{
double[] q = new double[order];
for (int i = 0; i < order; i++)
{
int indexFirst = I[i];
q[i] += A[indexFirst] * d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a * d[col];
q[col] += a * d[i];
}
}
return q;
}
我已经成功并行化了代码,方法是将问题分割成与 CPU 核心数量相等的部分,最后合并结果的 q
数组。但这似乎有点复杂。有更好的方法吗?
Parallel.For(0, cpuCount, delegate(int i)
{
MultiplySection(startIndices[i], endIndices[i], d, A, J, I, qSection[i]);
});
double[] q = new double[order];
for (int j = 0; j < cpuCount; j++)
{
for (int i = startIndices[j]; i < order; i++)
{
q[i] += qSection[j][i];
}
}
private static void MultiplySection(int startIndex, int endIndex, double[] d, List<double> A, List<int> J, int[] I, double[] q)
{
for (int i = startIndex; i < endIndex; i++)
{
int indexFirst = I[i];
q[i] += A[indexFirst] * d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a * d[col];
q[col] += a * d[i];
}
}
}
英文:
I need to parallelize this code. As you can see, the q
array contains values that are summed at every loop.
/// <summary>
/// q = A * d, with A stored in CSR format.
/// </summary>
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{
double[] q = new double[order];
for (int i = 0; i < order; i++)
{
int indexFirst = I[i];
q[i] += A[indexFirst] * d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a * d[col];
q[col] += a * d[i];
}
}
return q;
}
I was able to parallelize the code in this way (splitting the problem in a CPU count number of sections and summing up resulting q
arrays in the end) but it seems overcomplicated. Any better idea?
Parallel.For(0, cpuCount, delegate(int i)
{
MultiplySection(startIndices[i], endIndices[i], d, A, J, I, qSection[i]);
});
double[] q = new double[order];
for (int j = 0; j < cpuCount; j++)
{
for (int i = startIndices[j]; i < order; i++)
{
q[i] += qSection[j][i];
}
}
private static void MultiplySection(int startIndex, int endIndex, double[] d, List<double> A, List<int> J, int[] I, double[] q)
{
for (int i = startIndex; i < endIndex; i++)
{
int indexFirst = I[i];
q[i] += A[indexFirst]*d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a*d[col];
q[col] += a*d[i];
}
}
}
答案1
得分: 1
您当前的实现粒度较粗,因此如果一个CPU核心忙于其他任务,可能会导致在处理结束时利用率较低。
我建议至少尝试使用内置的Parallel.For
进行分区,这应该会提供更细粒度的并发,有助于提高效率。您还可以使用localInit
和localFinally
方法来创建和合并数组:
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{
double[] qTotal = new double[order];
Parallel.For(0,
order,
() => new double[order], // LocalInit
(i, loopState, q) => // Main Body
{
int indexFirst = I[i];
q[i] += A[indexFirst] * d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a * d[col];
q[col] += a * d[i];
}
return q;
},
q => // Local Finally
{
lock (qTotal)
{
for (int i = 0; i < q.Length; i++)
{
qTotal[i] += q[i];
}
}
});
return qTotal;
}
如果循环体内的工作量较小,您仍然可能会从批处理中受益,但您可能需要使用固定的批处理大小,而不是基于处理器数量来确定它。这应该有助于减少各种类型的开销。您可能需要进行一些实验以找到最佳的批处理大小。
英文:
Your current implementation is coarse grained, so if a CPU core is busy with something else you risk having poor utilization at the end of your process.
I would at least try using the built in partitioning by Parallel.For. This should give a much fine grained concurrency that might help with efficiency. You can also use localInit/localFinally methods to create and merge your arrays:
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{
double[] qTotal = new double[order];
Parallel.For(0,
order,
() => new double[order], // LocalInit
(i, loopState, q) => // Main Body
{
int indexFirst = I[i];
q[i] += A[indexFirst] * d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a * d[col];
q[col] += a * d[i];
}
return q;
},
q => // Local Finally
{
lock (qTotal)
{
for (int i = 0; i < q.Length; i++)
{
qTotal[i] += q[i];
}
}
});
return qTotal;
}
If the amount of work inside the loop body is small you might still benefit from batching, but you might want to use a fixed batch size instead of basing it of the number of processors. This should help a bit with various types of overheads. You will likely need to experiment a bit to find your optimal batch size.
答案2
得分: 1
以下是您要翻译的内容:
从Parallel vs Serial results aren't understandable中的一个答案中的以下代码,在具有6核心超线程的机器上提供了9倍更快的性能。这是核心数量的1.5倍。根据CPU的SSE/AVX寄存器大小,可能可以通过SIMD操作进一步提高2倍至4倍的速度。
static double[,] MatrixProductPVect(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
double[] row = new double[aCols];
System.Buffer.BlockCopy(matrixA, i * aCols * sizeof(double), row, 0, aCols * sizeof(double));
// each row of A
for (int j = 0; j < bCols; ++j) // each col of B
{
double temp = 0;
for (int k = 0; k < aCols; ++k) // could use k < bRows
temp += row[k] * matrixB[k, j];
result[i, j] = temp;
}
});
return result;
}
并行处理中的主要性能杀手是同步和非局部内存访问。
- 访问已经在CPU高速缓存中的值比访问RAM更快。
- 每个核心处理其自己的数据分区比尝试同步访问相同数据更快。
- 使用高速缓存数据时,沿着行访问值比跳转N个项目以获取下一行的下一列的方式更快。
- 最后,在计算总数时,计算一个存储在CPU寄存器中的本地总数比尝试增加共享变量更快。
这段代码以以下方式利用了这些规则:
Parallel.For
将行分割成所有核心,每个核心访问其自己的连续行。- 处理沿着行进行,这意味着数据紧密排列在一起,通常已经在高速缓存中。
System.Buffer.BlockCopy
创建每一行的本地副本,因此核心不必同步访问RAM。temp
总数在列循环内本地计算。
这个乘法代码借鉴自这篇文章。
答案使用BenchmarkDotNet来比较多个乘法版本,从顺序执行到简单并行执行,再到使用本地行副本执行的版本。
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
[Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
Method | Mean | Error | StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
SerialMult | 288.63 ms | 5.736 ms | 8.585 ms | 1.00 |
ParallelMult | 75.92 ms | 1.311 ms | 1.162 ms | 0.26 |
ParallelMultTemp | 49.63 ms | 0.873 ms | 0.817 ms | 0.17 |
ParallelMultVect | 33.04 ms | 0.588 ms | 0.521 ms | 0.11 |
英文:
The following code, from an answer to Parallel vs Serial results aren't understandable results in 9x faster performance in a 6-core hyper-threaded machine. That's 1.5 the number of cores. It could be improved perhaps with SIMD operations for another 2x-4x speedup, depending on the CPU's SSE/AVX register size
static double[,] MatrixProductPVect(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
double[] row = new double[aCols];
System.Buffer.BlockCopy(matrixA, i*aCols*sizeof(double), row, 0, aCols*sizeof(double));
// each row of A
for (int j = 0; j < bCols; ++j) // each col of B
{
double temp=0;
for (int k = 0; k < aCols; ++k) // could use k < bRows
temp += row[k] * matrixB[k, j];
result[i, j] = temp;
}
});
return result;
}
The main performance killers in parallel processing is synchronization and non-local memory access.
- It's faster to access values that are already in the CPU's cache than going to RAM.
- It's also faster to have each core work with is own partition of the
data than trying to synchronize access to the same data. - With the cached data, it's faster to access values along a row instead of
jumping N items to get to the next column on the next row. - Finally, when calculating totals it's faster to calculate a local total,
that's stored in a CPU register, than trying to increment a shared
variable.
This code takes advantage of these rules in the following way:
Parallel.For
will partition the rows among all the cores, with each core accessing its own consecutive rows- Processing works along rows, which means the data are close together and usually already in cache
System.Buffer.BlockCopy
makes a local copy of each row so the cores don't have to synchronize access to RAM.- The
temp
total is calculated locally, inside the column loop.
The multiplication code was borrowed from this article.
The answer uses BenchmarkDotNet to compare multiple multiplication versions, from sequential, to naive parallel, to using local row copies
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
[Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
Method | Mean | Error | StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
SerialMult | 288.63 ms | 5.736 ms | 8.585 ms | 1.00 |
ParallelMult | 75.92 ms | 1.311 ms | 1.162 ms | 0.26 |
ParallelMultTemp | 49.63 ms | 0.873 ms | 0.817 ms | 0.17 |
ParallelMultVect | 33.04 ms | 0.588 ms | 0.521 ms | 0.11 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论