英文:
Can I align DPDK rte_mbufs to enable direct file IO (O_DIRECT) on Linux
问题
以下是您要翻译的代码部分:
背景
我正在编写一个需要通过网络接口流式传输数据并将其写入磁盘的应用程序,需要非常高的吞吐量。网络和文件IO组件分别实现,两者都能够独立实现项目所需的吞吐量。网络部分利用了 DPDK
(更相关),文件IO部分则利用了 io_uring
(不太相关)。为了实现我所需的高文件IO吞吐量,我必须使用直接IO(O_DIRECT);无论用于实现文件IO的技术如何,都必须如此。简单地使用页面缓存不是一个选择。应用程序必须从网络接口(NIC)到我们用于存储的NVMes实现零拷贝。
问题
我无法对齐DPDK消息缓冲区(rte_mbuf
),以启用直接IO。这严重限制了我的文件IO吞吐量,如果不可能实现,我可能需要寻找DPDK的替代方案,当然,我希望避免这种情况。有人知道如何实现这种内存对齐吗? 消息缓冲区应该对齐到多个4096的地址。
代码片段
有多种设置DPDK内存池(rte_mempool
)和消息缓冲区的方法。目前,我正在使用 rte_pktmbuf_pool_create()
(如下所示),它通过一个函数调用创建内存池并分配消息缓冲区,但如果有助于获得我所需的对齐,我也愿意采用不同的方法。
初始化内存池
rte_pktmbuf_pool_create(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket);
其中...
DPDK_MBUF_CACHE_SIZE
由硬件确定,设置为315mbuf_size
是9000 + RTE_PKTMBUF_HEADROOM(由DPDK定义为128)+ RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN
英文:
Background
I am writing an application that requires me to stream data over a network interface and write it to disk at a very high throughput. The network and file IO components were implemented separately, and both are able to independently achieve the throughput required for the project. The networking side leverages DPDK
(more relevant) and the file IO side leverages io_uring
(less relevant). To achieve the high file IO throughput that I need, I must use direct IO (O_DIRECT); this is true regardless of the technology used to achieve the file IO. Using the page cache simply is not an option. The application must be zero-copy from the NIC to the NVMes we are using for storage.
The problem
I have been unable to align the DPDK message buffers (rte_mbuf
) to enable the direct IO. This severely limits my file IO throughput and if it is not possible, I will likely need to find an alternative to DPDK, which of course, I would like to avoid. Does anyone know how this memory alignment can be achieved? The message buffers should be aligned to addresses that are multiples of 4096.
Code Snippets
There are a number of ways to set up the DPDK mempools (rte_mempoool
) and message buffers. Right now, I am using rte_pktmbuf_pool_create()
(as seen below), which creates a mempool and allocates the message buffers all with one function call, but I am open to going with a different approach if it helps me to get the alignment I need.
Initializing a mempool
rte_pktmbuf_pool_create(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket);
Where...
DPDK_MBUF_CACHE_SIZE
is hardware-determined and is set to 315mbuf_size
is 9000 + RTE_PKTMBUF_HEADROOM (defined by DPDK to be 128) + RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN
答案1
得分: 0
以下是代码部分的中文翻译:
register_external_buffers()
分配外部内存区域,并将它们注册到 DPDK
。
unsigned register_external_buffers(rte_device* device, uint32_t num_mbufs, uint16_t mbuf_size, unsigned socket, rte_pktmbuf_extmem **ext_mem)
{
rte_pktmbuf_extmem *extmem_array; // 外部内存描述符数组
unsigned elements_per_zone; // 内存按区域保留和注册
unsigned n_zones; // 需要容纳所有 mbuf 的区域数量
uint16_t element_size; // 一个 mbuf 元素的大小(以字节为单位)
int status; // 用于存储错误代码/返回值
element_size = RTE_ALIGN_CEIL(mbuf_size, 4096);
elements_per_zone = RTE_PGSIZE_1G / element_size;
n_zones = (num_mbufs / elements_per_zone) + ((num_mbufs % elements_per_zone) ? 1 : 0);
extmem_array = new rte_pktmbuf_extmem[n_zones];
for (int extmem_index = 0; extmem_index < n_zones; extmem_index++)
{
rte_pktmbuf_extmem *current_extmem = extmem_array + extmem_index;
current_extmem->buf_ptr = mmap(NULL, RTE_PGSIZE_1G, PROT_READ | PROT_WRITE, MAP_HUGETLB | MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE | MAP_LOCKED, -1, 0);
current_extmem->buf_iova = NULL;
current_extmem->buf_len = RTE_PGSIZE_1G;
current_extmem->elt_size = element_size;
rte_extmem_register(current_extmem->buf_ptr, current_extmem->buf_len, NULL, 0, RTE_PGSIZE_1G);
rte_dev_dma_map(device, current_extmem->buf_ptr, (rte_iova_t) current_extmem->buf_ptr, current_extmem->buf_len);
}
*ext_mem = extmem_array;
return n_zones;
}
然后 register_external_buffers
可以如下使用:
rte_eth_dev_info dev_info;
rte_eth_dev_info_get(port_id, &dev_info);
unsigned length = register_external_buffers(dev_info.device, num_bufs, mbuf_size, cpu_socket, &extmem);
m_rx_pktbuf_pools.at(cpu_socket) = rte_pktmbuf_pool_create_extbuf(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket, extmem, length);
虽然这确实使所有 mbuf 数据存储并对齐在外部内存区域中,但它们对齐到大页面而不是典型的 4k 页面。这意味着虽然最初的问题已经解决,但对于这种用例,解决方案并不是非常实际,因为页面边界的数量非常有限。
英文:
See the following code snippets, which provide a solution. Be sure to read all the way to the bottom before trying to implement something similar in your project. Also, please note that all critical error handling has been removed for the sake of brevity, and should be added back into any similar implementation.
register_external_buffers()
allocates the external memory areas in huge pages and registers them with DPDK
.
unsigned register_external_buffers(rte_device* device, uint32_t num_mbufs, uint16_t mbuf_size, unsigned socket, rte_pktmbuf_extmem **ext_mem)
{
rte_pktmbuf_extmem *extmem_array; // Array of external memory descriptors
unsigned elements_per_zone; // Memory is reserved and registered in zones
unsigned n_zones; // Number of zones needed to accomodate all mbufs
uint16_t element_size; // Size, in bytes, of one mbuf element
int status; // Used to store error codes / return values
element_size = RTE_ALIGN_CEIL(mbuf_size, 4096);
elements_per_zone = RTE_PGSIZE_1G / element_size;
n_zones = (num_mbufs / elements_per_zone) + ((num_mbufs % elements_per_zone) ? 1 : 0);
extmem_array = new rte_pktmbuf_extmem[n_zones];
for (int extmem_index = 0; extmem_index < n_zones; extmem_index++)
{
rte_pktmbuf_extmem *current_extmem = extmem_array + extmem_index;
current_extmem->buf_ptr = mmap(NULL, RTE_PGSIZE_1G, PROT_READ | PROT_WRITE, MAP_HUGETLB | MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE | MAP_LOCKED, -1, 0);
current_extmem->buf_iova = NULL;
current_extmem->buf_len = RTE_PGSIZE_1G;
current_extmem->elt_size = element_size;
rte_extmem_register(current_extmem->buf_ptr, current_extmem->buf_len, NULL, 0, RTE_PGSIZE_1G);
rte_dev_dma_map(device, current_extmem->buf_ptr, (rte_iova_t) current_extmem->buf_ptr, current_extmem->buf_len);
}
*ext_mem = extmem_array;
return n_zones;
}
Then register_external_buffers
might be used as follows:
rte_eth_dev_info dev_info;
rte_eth_dev_info_get(port_id, &dev_info);
unsigned length = register_external_buffers(dev_info.device, num_bufs, mbuf_size, cpu_socket, &extmem);
m_rx_pktbuf_pools.at(cpu_socket) = rte_pktmbuf_pool_create_extbuf(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket, extmem, length);
While this does result in all mbuf data being stored and aligned in external memory areas, they are aligned to huge pages -- not the typical 4k pages. This means that while the initial problem was solved, the solution is not very practical for this use case, as the number of page boundaries is very limited.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论