可以将DPDK rte_mbufs对齐以启用Linux上的直接文件IO(O_DIRECT)。

huangapple go评论71阅读模式
英文:

Can I align DPDK rte_mbufs to enable direct file IO (O_DIRECT) on Linux

问题

以下是您要翻译的代码部分:

背景

我正在编写一个需要通过网络接口流式传输数据并将其写入磁盘的应用程序,需要非常高的吞吐量。网络和文件IO组件分别实现,两者都能够独立实现项目所需的吞吐量。网络部分利用了 DPDK(更相关),文件IO部分则利用了 io_uring(不太相关)。为了实现我所需的高文件IO吞吐量,我必须使用直接IO(O_DIRECT);无论用于实现文件IO的技术如何,都必须如此。简单地使用页面缓存不是一个选择。应用程序必须从网络接口(NIC)到我们用于存储的NVMes实现零拷贝。

问题

我无法对齐DPDK消息缓冲区(rte_mbuf),以启用直接IO。这严重限制了我的文件IO吞吐量,如果不可能实现,我可能需要寻找DPDK的替代方案,当然,我希望避免这种情况。有人知道如何实现这种内存对齐吗? 消息缓冲区应该对齐到多个4096的地址。

代码片段

有多种设置DPDK内存池(rte_mempool)和消息缓冲区的方法。目前,我正在使用 rte_pktmbuf_pool_create()(如下所示),它通过一个函数调用创建内存池并分配消息缓冲区,但如果有助于获得我所需的对齐,我也愿意采用不同的方法。

初始化内存池

rte_pktmbuf_pool_create(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket);

其中...

  • DPDK_MBUF_CACHE_SIZE 由硬件确定,设置为315
  • mbuf_size 是9000 + RTE_PKTMBUF_HEADROOM(由DPDK定义为128)+ RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN
英文:

Background

I am writing an application that requires me to stream data over a network interface and write it to disk at a very high throughput. The network and file IO components were implemented separately, and both are able to independently achieve the throughput required for the project. The networking side leverages DPDK (more relevant) and the file IO side leverages io_uring (less relevant). To achieve the high file IO throughput that I need, I must use direct IO (O_DIRECT); this is true regardless of the technology used to achieve the file IO. Using the page cache simply is not an option. The application must be zero-copy from the NIC to the NVMes we are using for storage.

The problem

I have been unable to align the DPDK message buffers (rte_mbuf) to enable the direct IO. This severely limits my file IO throughput and if it is not possible, I will likely need to find an alternative to DPDK, which of course, I would like to avoid. Does anyone know how this memory alignment can be achieved? The message buffers should be aligned to addresses that are multiples of 4096.

Code Snippets

There are a number of ways to set up the DPDK mempools (rte_mempoool) and message buffers. Right now, I am using rte_pktmbuf_pool_create() (as seen below), which creates a mempool and allocates the message buffers all with one function call, but I am open to going with a different approach if it helps me to get the alignment I need.

Initializing a mempool

rte_pktmbuf_pool_create(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket);

Where...

  • DPDK_MBUF_CACHE_SIZE is hardware-determined and is set to 315
  • mbuf_size is 9000 + RTE_PKTMBUF_HEADROOM (defined by DPDK to be 128) + RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN

答案1

得分: 0

以下是代码部分的中文翻译:


register_external_buffers() 分配外部内存区域,并将它们注册到 DPDK

unsigned register_external_buffers(rte_device* device, uint32_t num_mbufs, uint16_t mbuf_size, unsigned socket, rte_pktmbuf_extmem **ext_mem)
{
    rte_pktmbuf_extmem *extmem_array; // 外部内存描述符数组
    unsigned elements_per_zone; // 内存按区域保留和注册
    unsigned n_zones; // 需要容纳所有 mbuf 的区域数量
    uint16_t element_size; // 一个 mbuf 元素的大小(以字节为单位)
    int status; // 用于存储错误代码/返回值

    element_size = RTE_ALIGN_CEIL(mbuf_size, 4096);
    elements_per_zone = RTE_PGSIZE_1G / element_size;
    n_zones = (num_mbufs / elements_per_zone) + ((num_mbufs % elements_per_zone) ? 1 : 0);
    extmem_array = new rte_pktmbuf_extmem[n_zones];

    for (int extmem_index = 0; extmem_index < n_zones; extmem_index++) 
    {
        rte_pktmbuf_extmem *current_extmem = extmem_array + extmem_index;
        current_extmem->buf_ptr = mmap(NULL, RTE_PGSIZE_1G, PROT_READ | PROT_WRITE, MAP_HUGETLB | MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE | MAP_LOCKED, -1, 0);
        current_extmem->buf_iova = NULL;
        current_extmem->buf_len = RTE_PGSIZE_1G;
        current_extmem->elt_size = element_size;

        rte_extmem_register(current_extmem->buf_ptr, current_extmem->buf_len, NULL, 0, RTE_PGSIZE_1G);
        rte_dev_dma_map(device, current_extmem->buf_ptr, (rte_iova_t) current_extmem->buf_ptr, current_extmem->buf_len);
    }
    *ext_mem = extmem_array;
    return n_zones;
}

然后 register_external_buffers 可以如下使用:

rte_eth_dev_info dev_info;
rte_eth_dev_info_get(port_id, &dev_info);
unsigned length = register_external_buffers(dev_info.device, num_bufs, mbuf_size, cpu_socket, &extmem);

m_rx_pktbuf_pools.at(cpu_socket) = rte_pktmbuf_pool_create_extbuf(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket, extmem, length);

虽然这确实使所有 mbuf 数据存储并对齐在外部内存区域中,但它们对齐到大页面而不是典型的 4k 页面。这意味着虽然最初的问题已经解决,但对于这种用例,解决方案并不是非常实际,因为页面边界的数量非常有限。

英文:

See the following code snippets, which provide a solution. Be sure to read all the way to the bottom before trying to implement something similar in your project. Also, please note that all critical error handling has been removed for the sake of brevity, and should be added back into any similar implementation.


register_external_buffers() allocates the external memory areas in huge pages and registers them with DPDK.

unsigned register_external_buffers(rte_device* device, uint32_t num_mbufs, uint16_t mbuf_size, unsigned socket, rte_pktmbuf_extmem **ext_mem)
{
    rte_pktmbuf_extmem *extmem_array; // Array of external memory descriptors
    unsigned elements_per_zone; // Memory is reserved and registered in zones
    unsigned n_zones; // Number of zones needed to accomodate all mbufs
    uint16_t element_size; // Size, in bytes, of one mbuf element
    int status; // Used to store error codes / return values

    element_size = RTE_ALIGN_CEIL(mbuf_size, 4096);
    elements_per_zone = RTE_PGSIZE_1G / element_size;
    n_zones = (num_mbufs / elements_per_zone) + ((num_mbufs % elements_per_zone) ? 1 : 0);
    extmem_array = new rte_pktmbuf_extmem[n_zones];

    for (int extmem_index = 0; extmem_index < n_zones; extmem_index++) 
    {
        rte_pktmbuf_extmem *current_extmem = extmem_array + extmem_index;
        current_extmem->buf_ptr = mmap(NULL, RTE_PGSIZE_1G, PROT_READ | PROT_WRITE, MAP_HUGETLB | MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE | MAP_LOCKED, -1, 0);
        current_extmem->buf_iova = NULL;
        current_extmem->buf_len = RTE_PGSIZE_1G;
        current_extmem->elt_size = element_size;

        rte_extmem_register(current_extmem->buf_ptr, current_extmem->buf_len, NULL, 0, RTE_PGSIZE_1G);
        rte_dev_dma_map(device, current_extmem->buf_ptr, (rte_iova_t) current_extmem->buf_ptr, current_extmem->buf_len);
    }
    *ext_mem = extmem_array;
    return n_zones;
}

Then register_external_buffers might be used as follows:

rte_eth_dev_info dev_info;
rte_eth_dev_info_get(port_id, &dev_info);
unsigned length = register_external_buffers(dev_info.device, num_bufs, mbuf_size, cpu_socket, &extmem);

m_rx_pktbuf_pools.at(cpu_socket) = rte_pktmbuf_pool_create_extbuf(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket, extmem, length);

While this does result in all mbuf data being stored and aligned in external memory areas, they are aligned to huge pages -- not the typical 4k pages. This means that while the initial problem was solved, the solution is not very practical for this use case, as the number of page boundaries is very limited.

huangapple
  • 本文由 发表于 2023年4月13日 22:59:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76006955.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定