如何能够快速提供低延迟的机器学习模型服务?

huangapple go评论74阅读模式
英文:

How can I serve ML models quickly and with a low latency

问题

假设用户通过WebSocket连接到服务器,服务器基于个性化的JSON文件提供一个个性化的TypeScript函数。

所以当用户连接时,

  • 从类似S3的存储桶加载个性化的JSON文件(每个用户约60-100 MB)
  • 当用户输入TypeScript/JavaScript/Python代码时,执行代码以返回某些字符串作为响应,并更新类似JSON的数据结构
  • 当用户断开连接时,将JSON持久化回S3类似的存储桶。

总共大约有10,000个用户,因此总共大约600 GB。

这个系统应该:

  • 对于用户来说启动迅速,
  • 鉴于用户数量众多,应具备高度可伸缩性(以便不浪费资金),以及
  • 在全球范围内延迟应在几十毫秒内。

这是否可行?如果可行,最适合的架构是什么?

英文:

Assume a user connects via a Websocket connection to a server, which serves a personalized typescript function based on a personalized JSON file

So when a user connects,

  • the personalized JSON file is loaded from an S3-lile bucket (around 60-100 MB per user)
  • and when he types a Typescript/JavaScript/Python code is executed which returns some string a reply and the JSON-like data structure gets updates
  • when the user disconnects the JSON gets persisted back to the S3-like bucket.

In total, you can think about 10,000 users, so 600 GB in total.

It should

  • spin up fast for a user,
  • should be very scalable given the number of users (such that we do not waste money) and
  • have a global latency of a few tens of ms.

Is that possible? If so, what architecture seems to be the most fitting?

答案1

得分: 1

Sure, here is the translation of the text you provided, excluding the code:

Q: "Is that possible?"

让我们简要概述一下单用户单交易的端到端延迟预算构成:

  1. 如果位于同一位置,用户可能需要大约 1 [毫秒],但在将数据包发送到现场的RTO连接中,延迟可能高达 150+ [毫秒](在此,我们为简便起见忽略了所有套接字初始化和设置协商的成本)。

  2. 服务器可能需要花费大约 25+ [毫秒] 或更多时间,以从RAM中“读取”授权用户特定的JSON格式字符串,该字符串是SER/DES-ed字符串的第一个寻址/索引,仍然是 key:value 对的字符串表示形式(在此,我们为简便起见忽略了NUMA生态系统的非独占使用的所有附加成本,这些成本用于实际查找、物理读取和交叉NUMA传输 60~100 MB 的授权用户特定数据,从远程的大约千兆字节大小的非RAM存储到本地CPU核心RAM区域的最终目的地)。

  3. JSON解码器可能会在 60~100 MB 数据字典上进行重复的 key:value 测试,并耗费不同数量的时间。

  4. ML模型可能会在 .predict() 方法的内部评估上耗费不同数量的时间。

  5. 服务器将花费额外时间来组装用户的回复。

  6. 网络再次添加传输延迟,原则上类似于在项目1中经历的延迟。

  7. 服务器接下来将花费额外时间对每个用户和每个事件进行特定的修改,在RAM中保留的JSON编码的 60~100 MB 数据字典中(如果用户体验延迟是设计优先级,此部分应始终发生在上述项目之后)。

  8. 服务器将接下来花费额外的时间在跨NUMA生态系统的数据传输和存储的相反方向上。与项目2相似,这次数据流可以享受非关键/异步/缓存/延迟掩码的物理资源使用模式,这在项目2中并不适用,除非从本地CPU核心的RAM表示开始,重新SER-串行化成字符串,然后跨所有跨NUMA生态系统的互连,一直到最后的冷存储物理存储设备(这几乎肯定不会发生在这里)。

(小计 用于单用户单交易单预测的 ... [毫秒])

让我们简要概述一下当多用户多交易的现实进入ZOO时会出现什么问题:

a. 所有到目前为止乐观(被假设为独占的)资源将开始在处理性能/传输吞吐量方面下降,这将增加或增加实际达到的延迟,因为并发请求现在将导致进入阻塞状态(无论是在微观级别,如CPU核心的LRU缓存补给延迟,还是在宏观级别,如排队等待本地ML模型的.predict() 方法运行的QPI扩展访问时间到CPU核心非本地RAM区域,这在非共享的单用户单交易中是不存在的,因此永远不要期望资源公平分配)。

b. 一切在项目7和8中的延迟(ALAP)写入都将成为端到端延迟关键路径的一部分,因为JSON编码的 60~100 MB 数据写回必须尽快完成,而不是尽可能晚,因为永远不知道,同一用户的另一个请求何时到来,下一个射击必须重新获取已更新的JSON数据,以避免失去此用户特定JSON数据顺序的强制演进序列的强制更新。

(小计 对于约10k+多用户多交易多预测,将很难保持在几十 [毫秒] 内)


架构?

嗯,鉴于上述的计算策略,似乎没有架构可以“拯救”所有这些请求的主要低效性。

对于那些必须采用超低延迟设计的行业领域,核心设计原则是避免任何增加端到端延迟的不必要源。

  • 二进制紧凑的BLOB占主导地位(JSON字符串在所有阶段都很昂贵,从存储,到所有网络传输的流量,再到重复的序列化/反序列化重新处理)。

  • 不足的内存计算扩展使大型设计需要将ML模型移近到生态系统边缘,而不是位于NUMA生态系统核心内的单一CPU/RAM块/缓存消耗者。

(看起来复杂吗?是的,这是复杂的,分布式计算(超)低延迟是一个技术上的难题,而不是某种“黄金子弹”架构的自由选择。)

英文:

>Q: "Is that possible?"

Let's make a sketch of a single-user single-transaction end-2-end latency budget composition :

  1. User may spend from about first 1 [ms] if colocated, yet up to 150+ [ms] for sending packet over the live, RTO connection ( Here we ignore all socket initiation & setup negotiations for simplicity )

  2. Server may spend anything above 25+ [ms] for "reading" an auth'd-user specific JSON-formatted string from RAM upon a 1st seeking/indexing of SER/DES-ed string of still string representation of the key:value pairs ( Here we ignore all add-on costs of non-exclusive use of NUMA ecosystem, spent on actual finding, physical reading and cross-NUMA transport of those 60 ~ 100 MB of auth'd-user specific data from a remote, about a TB-sized off-RAM storage into the final destination inside a local CPU-core RAM area for simplicity )

  3. JSON-decoder may spend any amounts of additional time on repetitive key:value tests over the 60 ~ 100 MB data dictionary

  4. ML-model may spend any amounts of additional time on .predict()-method's internal evaluation

  5. Server will spend some additional time for assembling a reply to the user

  6. Network will again add transport latency, principally similar to the one experienced under item 1 above

  7. Server will next spend some additional time for a per-user & per-incident specific modification of the in-RAM, per-user maintained, JSON-encoded 60 ~ 100 MB data dictionary ( This part ought always happen after items above, if UX latency was a design priority )

  8. Server will next spend some additional time on an opposite direction of cross-NUMA exosystem data transport & storage. While mirroring the item 2, this time the data-flow may enjoy non-critical/async/cached/latency masked deferred usage of physical resources' patterns, which was not the case under item 2, where no pre-caching will happen unless some TB-sized, exclusive-use, never-evicted cache footprints are present and reserved end-to-end, alongside the whole data transport trajectory from the local CPU-core in-RAM representation, re-SER-ialisation into string, over all the cross-NUMA exosystem interconnects, to the very last cold-storage physical storage device (which is almost sure will not happen here)

( subtotal ... [ms] for a single-user single-transaction single-prediction )

Let's make a sketch of what else gets wrong once many-users many-transactions reality gets into the ZOO :

a.<br>All so far optimistic ( having been assumed as exclusive ) resources will start to degrade in processing performance / transport throughputs, which will add and/or increase actually achieved latencies, because concurrent requests will now result in entering blocking states ( both on micro-level like CPU-core LRU cache resupply delays, cross-QPI extended access times to CPU-core non-local RAM areas and macro-level like enqueuing all calls to wait before a local ML-model .predict()-method is free to run, none of which were present in the non-shared single-user single-transaction with unlimited exclusive resources usage above, so never expect a fair split of resources )

b.<br>Everything what was "permissive" for a deferred ( ALAP ) write in the items 7 & 8 above, will now become a part of the end-to-end latency critical-path, as also the JSON-encoded 60 ~ 100 MB data write-back has to be completed ASAP, not ALAP, as one never knows, how soon another request from the same user will arrive and any next shot has to re-fetch an already updated JSON-data for any next request ( perhaps even some user-specific serialisation of sequence of requests will have to get implemented, so as to avoid loosing the mandatory order of self-evolution of this very same user-specific JSON-data sequential self-updates )

( subtotal for about 10k+ many-users many-transactions many-predictions <br> will IMHO hardly remain inside a few tens of [ms] )


Architecture?

Well, given the O/P sketched computation strategy, there seems to be no architecture to "save" all there requested principal inefficiencies.

For industry-segments where ultra-low latency designs are a must, the core design principle is to avoid any unnecessary sources of increasing the end-to-end latencies.

  • binary-compact BLOBs rule ( JSON-strings are hell expensive in all stages, from storage, for all network transports' flows, till the repetitive ser-/DES-erialisation re-processing )

  • poor in-RAM computing scaling makes big designs to move ML-models closer to the exosystem periphery, not the singular CPU/RAM-blocker/CACHE-depleter inside the core of the NUMA ecosystem

( Does it seem complex? Yeah, it is complex & heterogeneous, distributed computing for (ultra)low-latency is a technically hard domain, not a free choice of some "golden bullet" architecture )

huangapple
  • 本文由 发表于 2023年3月31日 21:50:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899314.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定