英文:
Unpredictable multithreading behavior using HuggingFace and FastAPI with Uvicorn workers
问题
以下是翻译好的部分:
我正在使用FastAPI和Uvicorn在Hugging Face模型上进行推断。
代码大致如下:
app = FastAPI()
@app.post("/inference")
async def func(text: str):
output = huggingfacepipeline(input.text)
return ...
我这样启动服务器:
uvicorn app:app --host 0.0.0.0 --port 8080 --workers 4
服务器具有足够的GPU(80GB)。
我期望发生的情况是每个4个工作进程都有自己的GPU内存空间,并且主线程有4个CPU分叉,每个工作进程一个。我可以使用nvidia-smi
检查GPU内存分配情况。所以应该有4个CPU分叉和4个GPU进程。
这在我使用较小的模型(例如GPT Neo 125m)时如时钟般运行。
但是当我使用较大的模型(例如16位的GPT-J)时,行为通常是不可预测的。有时会有4个CPU分叉,但只有3个GPU进程。尽管仍然有足够的空闲空间。有时只有1个GPU进程和4个CPU分叉。
这可能是什么原因,我该如何进一步诊断?
英文:
I am running inference on a hugging face model using FastAPI and Uvicorn.
The code looks roughly like this:
app = FastAPI()
@app.post("/inference")
async def func(text:str):
output = huggingfacepipeline(input.text)
return ...
I start the server like this:
uvicorn app:app --host 0.0.0.0 --port 8080 --workers 4
The server has enough GPU (80GB).
What I expect to happen is each of the 4 workers gets its own GPU memory space and there are 4 CPU forks of the main thread, 1 for each worker. I can check the GPU memory allocation using nvidia-smi
. So there should be 4 CPU forks and 4 processes in the GPU.
This ^ happens like clockwork when I use a smaller model (like GPT Neo 125m).
But when I use a larger model (like GPT-J in 16-bit), the behavior is often unpredictable. Sometimes there are 4 CPU forks but only 3 processes are in the GPU. Even though there is enough free space left over. Sometimes there is only 1 process in the GPU and 4 CPU forks.
What could be causing this and how do I diagnose further?
答案1
得分: 0
当使用多个工作进程时,每个工作进程都会在GPU上获得自己的模型副本。将模型加载到GPU中是一个占用内存的任务。将N个模型加载到内存中会导致频繁的超时错误。这些错误可以在dmesg
的输出中看到。
Uvicorn对工作进程的支持不是很好。当工作进程超时时,它不会持续尝试重新加载它。因此,通常只有比工作进程数量更少的模型副本被加载到GPU中。
当使用Gunicorn时,明确提到了超时错误。使用Gunicorn并配置以下选项可以解决这个问题:1)使用Uvicorn工作进程(因为FastAPI是异步的)和2)设置--timeout
选项的较高值。
英文:
When using multiple workers, each workers gets its own copy of the model in GPU. Loading the models into GPU is a memory-intensive task. Loading N models into memory leads to frequent timeout errors. These errors can be seen in the output of dmesg
.
Uvicorn doesn't have very good support for workers. When the worker times out, it doesn't continually try to reload it. Hence, frequently, only a smaller number of copies of the models (than the number of workers) is actually loaded into GPU.
The timeout errors are explicitly mentioned when Gunicorn is used. Using Gunicorn with 1) Uvicorn workers (because FastAPI is async) and 2) a high value for the --timeout
option takes care of the problem.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论