英文:
Is it possible to use CPU cache in Golang?
问题
考虑一些内存和CPU密集型的任务:
例如:任务块:从内存中读取16字节,然后进行CPU计算,然后写回内存。
而且这个任务块可以并行化,也就是说每个核心可以运行一个任务块。
例如:8个CPU需要8*16字节的缓存,但是可以同时进行。
英文:
Consider some memory and CPU intensive task:
e.g.: Task Block: read 16 bytes from memory then do CPU job. Then write back to memory.
And this Task Block can be parallelizable meaning each core can ran one Task Block.
e.g.: 8 CPU needs 8*16 byte cache but concurrently.
答案1
得分: 3
是的,就像在您的计算机上运行的所有其他代码一样,它们都使用CPU缓存。
要告诉您如何编写应用程序以最有效地使用缓存,这个问题太过广泛。我强烈建议您设置Go基准测试,然后重构代码并比较时间。(注意,不要在虚拟机中进行基准测试-无论在任何平台上,虚拟机的时钟都不够准确,无法用于Go的基准测试。请在本机操作系统上运行所有基准测试,而不是在虚拟机中运行)。
一切都归结为您编写应用程序的能力,以有效利用CPU缓存。这是一个更广泛的主题,涉及到您如何使用变量,它们更新的频率,什么保留在堆上或在堆栈上进行垃圾回收的频率等等。
这里有一个小例子,指引您朝着正确的方向阅读更多关于高效的L1和L2缓存开发的内容...
L1缓存使用64位行。如果您想存储4个16位的Int16,通常它们将被分配在堆栈上,并且很可能都存储在同一行缓存中。
假设您想更新其中一个Int16?嗯,CPU缓存无法更新行的一部分:它必须使整个行无效,并分配一个带有先前的3个Int16和您的新更新值的全新行缓存。
非常低效。
解决这个问题的一种方法是使用Int64,这样CPU缓存只会使1行无效,但仍然保留其他3个Int16以便快速读取。您是更多地进行推送还是弹出操作?等等。
再次强调,这高度取决于您的用例:如果您在这4个整数之间频繁进行上下文切换(例如,互斥锁),这甚至可能会减慢速度。在这种情况下,这是一个完全不同的优化问题。
我建议您阅读关于高频率缩放和堆栈和堆上的内存分配的相关资料。
英文:
Yes, and just like all other code running on your machine, they all use CPU cache.
It's much too broad of a question to tell you how to code your app to make it the most efficient use of cache. I highly recommend setting up Go Benchmarks and then refactor your code and compare times. (Note, do not benchmark within a VM - VMs, and kind on any platform, do not have accurate enough clocks for Go's benchmarking. Run all Benchmarks native to your OS instead, no VM).
It all comes down to your ability to code the application to make efficient use of that CPU cache. This is a much broader topic for how you use your variables, how often they get updated, what stays on the heap or gets GC on the stack and how often, etc.
One tiny example to point you in the right direction to read more about efficient L1 and L2 cache development...
L1 cache uses 64 bit rows. If you want to store 4x 16bit Int16s, typically they will be allocated on the stack and most likely all stored on the same row of cache.
Say you want to update one of the Int16s? Well, CPU cache cannot update part of the row: It will have to invalidate the entire row, and allocate a whole new row of cache with the previous 3 Int16s and your new updates value.
Very inefficient.
One solution to that problem is use Int64s, which the CPU cache will only invalidate 1 row but yet keep the other 3 in cache for quick reads. Are you doing more push or pops? etc.
Again, it highly depends on your use case: this may even slow things down if you are using a lot of context switching of those 4 ints (e.g. mutex locks). In which case that's a whole different problem to optimize.
I recommend reading up on high frequency scaling and memory allocations on the stack and heaps.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论