2020年1月6日 18:04:29go评论84阅读模式

英文:

Is it a good idea to run dynamically generated code on my GPU with the use of OpenCL, or there are better ways?

问题

介绍

所以，让我介绍一个问题。目前，我正在使用C#编写一个具有大量计算的程序（更准确地说，它是一个神经网络库），到目前为止，我一直使用标准数组来存储矩阵，但我想最好创建一个二维、三维矩阵类来封装我需要的所有矩阵操作，然后清理我的代码中的循环。

正如你可能知道的，使用基本的运算符重载可以很容易实现，但我遇到了另一个问题，它比在数组上进行常规的for循环要慢，因为在这种情况下，由运算符重载产生的中间类可能会导致额外的开销。我进行了谷歌搜索并找到了这篇文章，对我非常有用。简而言之，作者使用附加的类首先创建一个方程树，然后使用MSIL（Microsoft Intermediate Language）将其编译成C#方法，一次性解决方程。

但后来我想到了在我的GPU上运行矩阵计算的可能性，因为这将会更快。我找到了一个使用OpenCL的NuGet包Cloo，以及它的一个包装器（我希望它可以在任何显卡上工作，而不仅仅是NVidia的CUDA），以在GPU上运行C代码，但正如我刚才所说，它使用必须以字符串形式编写的C代码。

问题

最后，我的问题是，从方程树动态生成C代码字符串来计算我的优化方程在GPU上是否是一个好主意，或者是否有其他方法可以实现这一目标。

英文:

Introduction

So, let me introduce a problem. Currently, I'm writing a program in C# that has a lot of computations in it (more precisely it's a neural network lib) and by far I've used standard arrays to store matrixes, but I thought of it's better to create a 2d, 3d matrix class to encapsulate all matrix operations I need and then clean loops in my code.

As you may know, it's pretty easy to accomplish with basic operators overloading, but I came around another problem, it would be slower than regular for loops over arrays, as in the case you have a big equation, intermediate classes which are produced by overloading of operators may cause overhead. I googled it and found the article that turned out to be very useful for me. In short, the writer uses additional classes to first create an equation tree, second compile it in a C# method with the use of MSIL (Microsoft Intermediate Language) solves the equation at once.

But then I thought of the possibility of running matrix calculations on my GPU, as it will be even faster. I came around a NuGet package Cloo that uses OpenCL and a wrapper for it (I'd like it to work on any video card not only NVidia with its CUDA) to run C code on your GPU, but it as I've just said it uses C code that has to be written as a string.

Question

Finally, my question. Is it a good idea to generate the C code string dynamically, from the equation tree, to calculate my optimized equations on a GPU or there are other ways to accomplish that.

答案1

得分: 0

首先，对于你的问题的正确答案是 - 实施两种方法并编写基准测试。因为我们不知道使用的是什么GPU，使用的是什么CPU，矩阵大小等等。

从理论上讲，如果您能够使用SIMD/SIMT方法而不使用分支（例如不使用if），GPU应该更快。因此，如果您可以编写操作内部数组的平面代码，那么GPU（甚至嵌入式GPU）将更快地工作。然而，这里的主要关键词是理论上。

实际上：

基于.Net（和基于JVM）的代码要简单得多。
OpenCL代码对NVidia/AMD/Intel GPU工作方式不同（因为有时可以将大量数据存储在本地内存中，有时不能；有时可以依赖快速的GDDR6，有时显卡（例如嵌入式显卡）只是共享计算机内存）。
一些GPU内存分析需要视频冻结（以减轻波动）。在GPU开发过程中，您将遇到许多其他有趣的问题。

然而，为了帮助您：

首先尝试Tensorflow矩阵乘法。它有.Net绑定，可以在CPU和GPU上执行数学运算。
将.Net代码与本地代码进行比较，例如Rust/Kotlin Native/C++。也许您可以将所有计算移至本地部分（所有这些选项比OpenCL编码和支持要简单得多）。
从我的角度来看（这里没有证明，抱歉），在多种语言上编写代码要比从语言X生成语言Y的代码简单得多。

英文:

First of all, right answer for your question - implement both ways and write the benchmark. Because we don't know, what GPU is used, what CPU is used, what matrix size, etc.

Theoretically, GPU should be faster if you are able to use SIMD/SIMT approach without branches (e.g. without ifs). So if you can write planar code which operates internal arrays, then GPU (event embedded) will work faster. However, the main word here is theoretically.

Practically:

.Net-based (and JVM-based) code is much simpler to support.
OpenCL code works different for NVidia/AMD/Intel GPUs (because sometimes you can store a lot of data in the local memory, sometimes you couldn't; sometimes you can rely on fast GDDR6, sometimes videocard (for example - embedded) just shares computer RAM).
Some GPU memory profiling requires video freeze (to mitigate fluctuations). And you will have many other interesting items during the GPU development.

However to help you:

Try Tensorflow Matrix multiplication first. It has .Net bindings and it can do mathematics operations on both CPU and GPU.
Compare .Net code with native, for example - Rust/Kotlin Native/C++. Probably you can just move all computations into the native part (all these options are much simpler than OpenCL coding and supporting).
From my prospective (no proofs here, sorry) it is much simpler to write code on multiple languages than generate code at language X from language Y.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Is it a good idea to run dynamically generated code on my GPU with the use of OpenCL, or there are better ways?

问题

介绍

问题

Introduction

Question

答案1

JsonSchema.Net 在遇到第一个具有不适当错误的属性时停止验证。

如何对两个二维数组逐元素进行求和？

停止计分在玩家被摧毁后？

导出日期时间列表为字符串。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论