如何管理无法访问ARM中的数据高速缓存(d-cache)的设备

huangapple go评论65阅读模式
英文:

How to manage devices that cannot access d-cache in ARM

问题

我正在使用启用了DMA的SPI设备在STM32H7 SoC上。 DMA外设无法访问数据高速缓存,因此为了使其正常工作,我已完全禁用了数据高速缓存(有关更多信息,请参见此解释)。然而,我希望避免全局禁用数据高速缓存,而只影响内存小区域的问题。

我已经阅读了关于ARM领域中清理和使缓存失效操作含义的这篇帖子。我理解的是,通过清理缓存区域,你强制将其写入实际内存。另一方面,通过使缓存区域失效,你强制实际内存被缓存。这样理解对吗?

我的意图是按照以下步骤通过SPI(使用DMA)传输数据:

  1. 在DMA将读取的缓冲区上写入所需的值。
  2. 清理该区域的数据高速缓存,以强制将其写入实际内存,以便DMA可以看到它。
  3. 启动操作:DMA将从上述区域读取值,并将其写入SPI的发送缓冲区。
  4. SPI同时读取数据并写入,因此SPI的接收缓冲区中将有数据,DMA将读取它然后将其写入用户提供的接收缓冲区。可能发生观察者可以访问数据高速缓存的情况。后者可能尚未使用SPI接收到的新值进行更新,因此使接收缓冲区区域失效以强制更新数据高速缓存。

上述做法合理吗?

编辑

添加一些我所面临问题的更多来源/示例:

来自ST github的示例:https://github.com/STMicroelectronics/STM32CubeH7/issues/153

在ST论坛中回答和解释了数据高速缓存问题的帖子:https://community.st.com/s/question/0D53W00000m2fjHSAQ/confused-about-dma-and-cache-on-stm32-h7-devices

这里是内存和DMA之间的互连:
如何管理无法访问ARM中的数据高速缓存(d-cache)的设备
如您所见,DMA1可以访问sram1、2和3。我正在使用sram2。

这里是sram2的缓存属性:

如何管理无法访问ARM中的数据高速缓存(d-cache)的设备

如您所见,它是写回写分配,但不是写直通。我对这些属性不熟悉,所以我从这里阅读了定义。然而,那篇文章似乎谈论了CPU的物理高速缓存(L1、L2等)。我不确定ARM i-高速缓存和d-高速缓存是否指的是这个物理高速缓存。无论如何,我假设写直通和其他术语的定义对于d-高速缓存也是有效的。

英文:

I'm using an SPI device with DMA enabled in an STM32H7 SoC. The DMA periph. cannot access d-cache, so in order to make it work I have disabled d-cache entirely (for more info. about this, see this explanation). However, I would like to avoid disabling d-cache globally for a problem that only affects to a small region of memory.

I have read this post about the meaning of clean and invalidate cache operations, in the ARM domain. My understanding is that, by cleaning a cache area, you force it to be written in the actual memory. On the other hand, by invalidating a cache area, you force the actual memory to be cached. Is this correct?

My intention with this is to follow these steps to transmit something over SPI (with DMA):

  1. Write the value you want on the buffer that DMA will read from.
  2. Clean d-cache for that area to force it to go to actual memory, so DMA can see it.
  3. Launch the operation: DMA will read the value from the area above and write it to the SPI's Tx buffer.
  4. SPI reads data at the same time it writes, so there will be data in the SPI's Rx buffer, which will be read by DMA and then it will write it to the recv. buffer provided by the user. It could happen that an observer of such buffer can indeed access d-cache. The latter could not be updated with the new value received by SPI yet, so invalidate the recv. buffer area to force d-cache to get updated.

Does the above make sense?

EDIT

Adding some more sources/examples of the problem I'm facing:

Example from the ST github: https://github.com/STMicroelectronics/STM32CubeH7/issues/153

Post in ST forums answring and explaining the d-cache problem: https://community.st.com/s/question/0D53W00000m2fjHSAQ/confused-about-dma-and-cache-on-stm32-h7-devices

Here the interconnection between memory and DMA:
如何管理无法访问ARM中的数据高速缓存(d-cache)的设备
As you can see, DMA1 can access sram1, 2 and 3. I'm using sram2.

Here the cache attributes of sram2:

如何管理无法访问ARM中的数据高速缓存(d-cache)的设备

As you can see, it is write back,write allocate, but not write through. I'm not familiar with these attributes, so I read the definition from here. However, that article seems to talk about the CPU physical cache (L1, L2 etc.) I'm not sure if ARM i-cache and d-cache refer to this physical cache. In any case, I'm assuming the definition for write through and the other terms are valid for d-cache as well.

答案1

得分: 1

关于清除和使内存无效的问题,答案是肯定的:清除将强制缓存写入内存,使内存无效将强制内存被缓存。

关于我提出的步骤,同样是可以理解的。

以下是4个视频的顺序,解释了这种确切情况(DMA和内存一致性)。正如可以看到,视频中提出的“软件”解决方案(不涉及MPU)与我发布的步骤序列完全一致。

https://youtu.be/5xVKIGCPy2s

https://youtu.be/2q8IvCxSjaY

https://youtu.be/6IEtoG7m0jI

https://youtu.be/0DhYTqPCRiA

另一个提出的解决方案是配置Cortex-M7的MPU以更改特定内存区域的属性以保持内存一致性。

这还不包括最简单的解决方案,即全局禁用数据缓存,尽管自然地,这并不是理想的解决方案。

英文:

I have investigated a bit more:

With regards to clean and invalidate memory question, the answer is yes: clean will force cache to be written in memory and invalidate will force memory to be cached.

With regards to the steps I proposed, again yes, it makes sense.

Here is a sequence of 4 videos that explain this exact situation (DMA and memory coherency). As can be seen, the 'software' solution (doesn't involve MPU) proposed by the videos (and other resources provided above) is exactly the sequence of steps I posted.

https://youtu.be/5xVKIGCPy2s

https://youtu.be/2q8IvCxSjaY

https://youtu.be/6IEtoG7m0jI

https://youtu.be/0DhYTqPCRiA

The other proposed solution is to configure the cortex-m7 MPU to change the attributes of a particular memory region to keep memory coherency.

This all apart from the easiest solution which is to globally disable d-cache, although, naturally, this is not desirable.

答案2

得分: -1

我忘了手头上关于cortex-m7/armv7-m数据缓存如何工作的细节。我想记得它没有MMU,缓存是基于地址的。ARM和ST应该足够聪明,知道将处理器核心对SRAM的访问分成缓存和非缓存的。

如果你想使用DMA发送或接收数据,就不会经过缓存。

你链接了一个之前的问题,我已经提供了答案。

缓存包含一些SRAM,通常我们会看到指定了多少KB或者多少MB等规格。但也包括标签RAM和其他基础设施。缓存如何知道是否命中或者缺失,不是通过数据,而是通过其他信息的一些位来自于事务的地址。会取出地址的一些位与你拥有的“路”的数量进行比较,例如可能有8条路,那就有8个小内存,可以将它们想象成C语言中的结构数组。在那个结构中包含了一些信息,比如这个缓存行是否有效?如果有效,它与哪个标签或者地址的位相关联?它是干净的还是脏的。

干净或者脏意味着整个缓存基础设施将被设计(这也是整个目的)来保存在一个更快的SRAM中(MCUs中的SRAM已经非常快了,那为什么需要缓存?),这意味着写事务,如果它们通过缓存(某种形式上是应该的)会被写入缓存,然后根据设计/策略会被写入系统内存,或者至少会被写入缓存的内存侧面。当缓存包含了已经写入但系统内存中没有的信息(由于写入而导致的)时,它是脏的。当你清理缓存时,使用ARM的术语“clean”,或者“flush”是另一个术语,等等。你会遍历所有的缓存,查找有效且脏的项目,并启动写入系统内存以清理它们。这是强制将事物从缓存推送到系统内存以保持一致性的方法,如果你有这样的需求。

使缓存无效意味着你遍历标签RAM并将有效位更改为该缓存行的无效表示。基本上,这会“失去”关于该缓存行的所有信息,现在可以使用它。它不会导致任何命中,也不会对干净/flush的情况进行系统写入。实际上,在缓存内存中的缓存行不必被置零或处于任何其他状态。从技术上讲,只是有效/无效位或位。

通常将事物放入缓存的方式是从读取中获得的。根据设计和设置,如果读取是可缓存的,那么缓存将首先查看是否有该项目的标签,以及它是否有效,如果是,则简单地获取缓存中的信息并返回它。如果发生缺失,即在缓存中没有该数据的副本,则它会从系统侧发起一个或多个缓存行读取。因此,单字节读取可能会导致在系统侧发生一个更大的读取,有时可能会非常大,直到(更大的)数据(读取)返回,然后将其放入缓存中并将所请求的项目返回给处理器。

根据架构和设置,写入可能会或可能不会在缓存中创建一个条目,如果发生(可缓存的)写入并且在缓存中没有命中,则可能会直接进入系统侧作为该大小和形状的写入,就好像缓存不存在一样。如果有缓存命中,则它将进入缓存,并且该/那些缓存行将被标记为脏的,然后根据设计等等,它可能会作为来自处理器侧的写入的副作用而被写入系统内存,处理器将被释放以继续执行,但缓存和其他逻辑(写入缓冲区)可能会继续处理该事务,将这些新数据移动到系统侧,实质上是自动清理/刷新。通常情况下,人们并不指望发生这种情况,因为它削弱了缓存的性能提供的初衷。

无论如何,如果确定了事务缺失并且需要进行缓存,那么基于该标签,已经检查了“路”以确定是否有命中。将选择其中一条路来保存这个新的缓存行。如何确定这一点是基于设计和在某些情况下是可编程的设置的。希望如果有任何无效的话,它会去找一个那样的。但是轮询、随机选择、最先老的等等是你可能会看到的解决方案。如果那个空间中有脏数据,那么必须先将其写出,为新的信息腾出空间。所以,绝对会要求一个单字节或者单字读取(因为在这样的系统中它们的性能相同)可能需要刷新一个缓存行,然后从系统中读取,然后返回结果,比起缓存不存在要花费更多的时钟周期。这是这个“怪兽”的性质。缓存不是完美的,凭借正确的信息和经验,你可以很容

英文:

I forget off-hand how the data cache works on the cortex-m7/armv7-m. I want to remember it does not have an MMU and caching is based on address. ARM and ST would be smart enough to know to put cached and non-cached access to sram from the processor core.

If you are wanting to send or receive data using DMA you do not go through the cache.

You linked a question from before which I had provided an answer.

Caches contain some amount of sram as we tend to see a spec for this many KBytes or this many MBytes, whatever. But there are also tag rams and other infrastructure. How does the cache know if there is a hit or a miss. Not from the data, but from other bits of information. Taken from the address of the transaction. Some number of bits of that address are taken and compared to however many "ways" you have so there may be 8 ways for example so there are 8 small memories think of them as arrays of structures in C. In that structure is some information is this cache line valid? If valid what is the tag or bit of address that it is tied to, is it clean/dirty.

Clean or dirty meaning the overall caching infrastructure will be designed (kinda the whole point) to hold information in a faster sram (sram in mcus is very fast already so why a cache in the first place?), which means that write transactions, if they go through the cache (they should in some form) will get written to the cache, and then based on design/policy will get written out into system memory or at least get written on the memory side of the cache. While the cache contains information that has been written that is not also in system memory (due to a write) that is dirty. And when you clean the cache using ARM's term clean, or flush is another term, etc. You go through all of the cache and look for items that are valid and dirty and you initiate writes to system memory to clean them. This is how you force things out the cache into system memory for coherency reasons, if you have a need to do that.

Invalidate a cache simply means you go through the tag rams and you change the valid bit to indicate invalid for that cache line. Basically that "loses" all information about that cache line it is now available to use. It will not result in any hits and it will not do a write to the system for a clean/flush. The actual cache line in the cache memory does not have to be zeroed or put in any other state. Technically just the valid/invalid bit or bits.

How things generally get into a cache are certainly from reads. Depending on the design and settings if a read is cacheable then the cache will first look to see if it has a tag for that item and if it is valid, if so then it simply takes the information in the cache and returns it. If there is a miss, that data does not have a copy in the cache, then it initiates one or more cache line reads from the system side. So a single byte read can/will cause a larger, sometimes much larger, read to happen on the system side, the transaction is held until that (much larger) data (read) returns and then it is put in the cache and the item requested is returned to the processor.

Depending on the architecture and settings, writes may or may not create an entry in the cache, if a (cacheable) write happens and there are no hits in the cache then it may just go straight to the system side as a write of that size and shape. As if the cache was not there. If there is a cache hit then it will go into the cache, and the that/those cache lines are marked as dirty and then depending on the design, etc it may be written to system memory as a side effect of the write from the processor side, the processor will be freed to continue execution but the cache and other logic (write buffer) may continue to process this transaction moving this new data to the system side essentially cleaning/flushing automatically. One normally does not expect this as it takes away performance that the cache was there to provide in the first place.

In any case if it is determined that a transaction has a miss and it is to be cached, then based on that tag, the ways have already been examined to determine if there was a hit. One of the ways will be chosen to hold this new cache line. How that is determined is based on design and in some cases programmable settings. Hopefully if there are any that are invalid then it would go to one of those. But round robin, randomizer, oldest first, etc are solutions you may see. And if there is dirty data in that space then it has to get written out first, making room for the new information. So, absolutely a single byte or single word read (since they have the same performance in a system like this) can require a cache flush of a cache line, then a read from the system and then the result is returned, more clock cycles than if the cache was not there. Nature of the beast. Caches are not perfect, with the right information and experience you can easily write code that makes the cache degrade the performance of the application.

Clean means if a cache line is valid and dirty then write it out to system memory and mark it as clean.

Invalidate means if the cache line is valid then mark it as valid. If it was valid and dirty that information is lost.

In your case you do not want to deal with cache at all for these transactions, the cache in question is in the arm core so nobody but the arm core has access to that cache, nobody else is behind the cache, they are all on the system end.

Taking a quick look at the ARM ARM for armv7-m they do use address space to determine write through and cached or not. One then needs to look at the cortex-m7 TRM for further information and then, particularly in this case, since it is a chip thing not an arm thing anyway, the whole system. The arm processor is just some bit of IP that st bought to glue into a chip with a bunch of other IP and IP of their own that is glued together. Like the engine in the car, the engine manufacturer can't answer questions about the rear differential nor the transmission, that is the car company not the engine company.

  1. arm knows what they are doing

  2. st knows what they are doing

  3. if a chip company makes a chip with dma but the only path between the processor and the memory shared with the dma engine is through the processor's cache when the cache is enabled, and clean/flush and invalidate of address ranges are constantly required to use that dma engine... then you need to immediately discard that chip, blacklist that company's products (if this product is that poorly designed then assume all of their products are), and find a better company to buy products from.

I can't imagine that is the case here, so

  1. Initialize the peripheral, choosing to use DMA and configure the peripheral or dma engine or both (for each direction).

  2. Start the peripheral (this might be part of 4)

  3. write the tx data to the configured address space for dma

  4. tell the peripheral to start the transfer

  5. monitor for completion of transfer

  6. read the received data from the configured address space for dma

That is generic but that is what you are looking for, caches are not involved. For a part/family like this there should be countless examples including the (choose your name for the quality) one or more library solutions and examples that come from the chip vendor. Look at how they others are using the part, compare that to the documentation, determine your risk level for their solution and use it or modify it or learn from it if nothing else.

I know that st products do not have an instruction cache they do their own thing, or at least that is what I remember (some trademarked name for a flash cache, on most of them you cannot turn it off). Does that mean they have not implemented a data cache on the products either? Possible. Just because the architecture for an ip product has a feature (fpu, caches, ...) does not automatically mean that the chip vendor has enabled/implemented those.

Depending on the IP there are various ways to do that as some IP does not have a compile time option for the chip vendor to not compile in a feature. If nothing else the chip vendor could simply stub out the cache memory interfaces and write a few lines of text in the docs that there is no cache, and you can write control registers and see things appear to enable that feature but it simply does not work. One expects that arm provides compile time features, that are not in the public documentation we can see, but are available to the chip vendor in some form.

Sometimes when you buy the IP you are given a menu if you will like ordering a custom burger at a fancy burger shop, a list of checkboxes, mayo, mustard, pickle. ... fpu, cache, 16 bit fetch, 32 bit fetch, one cycle multiply, x cycle multiply, divide, etc. And the chip vendor then produces your custom burger. Or some vendors you get the whole burger then you have to pick off the pickles and onions yourself.

Firstly, find out if this part even have a dcache? Look between the arm arm, the arm trm and the documentation for the chip address spaces (as well as the countless examples) and determine what address space or whet settings, etc are needed to access portions of sram in a non-cached way. If it has a data cache feature at all.

huangapple
  • 本文由 发表于 2023年2月18日 23:15:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75494288.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定