英文:
HuggingFace Inference Endpoints extremely slow performance
问题
我使用HuggingFace的all-MiniLM-L6-v2模型来计算文本段落的向量嵌入。由于免费端点的响应速度不稳定,而且我需要能够扩展,所以我部署了该模型到了HuggingFace推理端点。一开始,我选择了最便宜的端点。
令我惊讶的是,单次请求计算35个嵌入居然需要超过7秒(根据HuggingFace的日志)。根据HuggingFace支持的建议,我尝试升级到2个CPU,但速度变得更慢了(说实话,我不确定他们为什么认为单个请求会从另一个CPU中受益)。接着,我尝试了GPU。现在请求只需要2秒。
我一定漏掉了什么,因为一个单独的请求要花费超过400美元/月,来在2秒内响应,这似乎不可能,而不是每秒服务成千上万的请求。
我猜我一定漏掉了什么,但我看不出可能是什么。
我使用以下格式的命令提交请求:
curl https://xxxxxxxxxxxxxx.us-east-1.aws.endpoints.huggingface.cloud -X POST -d '{ "inputs": ["我的段落平均有大约200个字", "另一个段落", 等等] }' -H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxxxx' -H 'Content-Type: application/json'
我可能漏掉了什么?
附言:对于GPU,在预热后情况会好得多,可以达到100毫秒。然而,这个特定模型在A100上每秒可以实现14,200个嵌入。尽管我运行的不是A100,但每秒350个嵌入仍然太慢了。
英文:
I compute vector embeddings for text paragraphs using the all-MiniLM-L6-v2 model at HuggingFace. Since the free endpoint wasn't always responsive enough and I need to be able to scale, I deployed the model to HuggingFace Inference Endpoints. To begin with, I chose the cheapest endpoint.
To my surprise, a single request to compute 35 embeddings took more than 7 seconds (according to the log at HuggingFace). Based on the suggestion of HuggingFace support, I tried to upgrade to 2 CPUs and it got even slower (to tell the truth, I am not sure why they thought that a single request would benefit from another CPU). Next, I tried GPU. The request now takes 2 seconds.
I must be missing something, because it seems impossible that one would pay >$400/month to serve a single request in 2 seconds, rather than serving thousands of request per second.
I guess that I must be missing something, but I don't see what it could be.
I submit the requests using the command in the following format:
curl https://xxxxxxxxxxxxxx.us-east-1.aws.endpoints.huggingface.cloud -X POST -d '{"inputs": ["My paragraphs are of about 200 words on average", "Another paragraph", etc.]}' -H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxxxx' -H 'Content-Type: application/json'
What could I be missing?
P.S. For the GPU, it does get much better once warmed up, achieving 100ms. However, this particular model achieves 14,200 embeddings per second on A100. Granted it's not A100 that I ran it on, but 350 embeddings per second is still way too slow.
答案1
得分: 1
为了测试单个CPU核心的效率,我使用了以下代码:
from sentence_transformers import SentenceTransformer
import time
sentences = ["This is an example sentence each sentence is converted"] * 10
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cpu')
for i in range(100):
start_time = time.time()
embeddings = model.encode(sentences)
end = time.time()
print("Time taken: ", end - start_time)
然后使用以下命令运行:
taskset -c 0 python soquestion.py
这会在这么多秒内运行10个句子嵌入:
...
Time taken: 0.035448551177978516
Time taken: 0.035162925720214844
Time taken: 0.03574204444885254
Time taken: 0.035799264907836914
Time taken: 0.03513455390930176
Time taken: 0.03690838813781738
Time taken: 0.035082340240478516
Time taken: 0.035216331481933594
Time taken: 0.0348513126373291
...
但如果我使用所有的核心:
...
Time taken: 0.016519546508789062
Time taken: 0.01624751091003418
Time taken: 0.017212390899658203
Time taken: 0.016582727432250977
Time taken: 0.019397735595703125
Time taken: 0.016611814498901367
Time taken: 0.017941713333129883
Time taken: 0.01743769645690918
...
所以我会说核心数量会影响速度。我使用的是AMD Ryzen 5 5000,所以可能比Hugging Face提供的Intel Xeon Ice Lake
CPU慢得多(他们并没有真正告诉你型号,性能差异很大...)。
但是,我可以说你的实例在内存方面不足,因为文档中的定价说明如下:
- aws small $0.06 1 2GB Intel Xeon - Ice Lake
- aws medium $0.12 2 4GB Intel Xeon - Ice Lake
- aws large $0.24 4 8GB Intel Xeon - Ice Lake
- aws xlarge $0.48 8 16GB Intel Xeon - Ice Lake
- azure small $0.06 1 2GB Intel Xeon
- azure medium $0.12 2 4GB Intel Xeon
- azure large $0.24 4 8GB Intel Xeon
- azure xlarge $0.48 8 16GB Intel Xeon
而你提到使用1到2个虚拟CPU,配备2-4 GB的内存。我通过以下方式调查了这个进程使用了多少内存:
/usr/bin/time -v python soquestion.py |& grep resident
Maximum resident set size (kbytes): 981724
Average resident set size (kbytes): 0
这是1 GB。与CPU实例相比,这是很多。与GPU实例相比,这很少。我建议你考虑升级你的实例,尽管我遇到了这个问题,即使有4 GB的内存也会出现问题。
英文:
To test the one CPU core efficiency I used:
from sentence_transformers import SentenceTransformer
import time
sentences = ["This is an example sentence each sentence is converted"] * 10
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cpu')
for i in range(100):
start_time = time.time()
embeddings = model.encode(sentences)
end = time.time()
print("Time taken: ", end - start_time)
And ran it with:
taskset -c 0 python soquestion.py
Which run 10 sentence embeddings in this much seconds:
...
Time taken: 0.035448551177978516
Time taken: 0.035162925720214844
Time taken: 0.03574204444885254
Time taken: 0.035799264907836914
Time taken: 0.03513455390930176
Time taken: 0.03690838813781738
Time taken: 0.035082340240478516
Time taken: 0.035216331481933594
Time taken: 0.0348513126373291
...
But if I use all of my cores:
...
Time taken: 0.016519546508789062
Time taken: 0.01624751091003418
Time taken: 0.017212390899658203
Time taken: 0.016582727432250977
Time taken: 0.019397735595703125
Time taken: 0.016611814498901367
Time taken: 0.017941713333129883
Time taken: 0.01743769645690918
...
So I would say core count affects speed. I'm using an AMD Ryzen 5 5000, so might or might not be significantly slower than the Intel Xeon Ice Lake
CPUs Hugging Face provide (they don't really tell you the model and the performance varies so much...).
However, I can say that your instances are insufficient memory wise because the docs for pricing states:
aws small $0.06 1 2GB Intel Xeon - Ice Lake
aws medium $0.12 2 4GB Intel Xeon - Ice Lake
aws large $0.24 4 8GB Intel Xeon - Ice Lake
aws xlarge $0.48 8 16GB Intel Xeon - Ice Lake
azure small $0.06 1 2GB Intel Xeon
azure medium $0.12 2 4GB Intel Xeon
azure large $0.24 4 8GB Intel Xeon
azure xlarge $0.48 8 16GB Intel Xeon
And you mentioned using 1 to 2 vCPUs which comes with 2-4 GB of RAM. I investigated how much RAM is used by this process by:
/usr/bin/time -v python soquestion.py |& grep resident
Maximum resident set size (kbytes): 981724
Average resident set size (kbytes): 0
Which is 1 GB. Compared to CPU instances, a lot. Compared to GPU instances, very little. I would suggest you to consider upgrading your instances altough I came across this question which is struggling even with 4 GB of RAM.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论