2023年6月6日 04:59:27go评论75阅读模式

英文:

Faster way to use saved Pytorch model (bypassing import torch?)

问题

I'm using a Slurm Workload Manager on a server and import torch takes around 30-40 seconds. The IT people running it said they couldn't do much to improve it and it was just hardware related (maybe they missed something? but I've gone through the internet before asking them and couldn't find much either). By comparison, import numpy takes around 1 second.

我正在使用服务器上的Slurm工作负载管理器，导入torch大约需要30-40秒。运行它的IT人员表示他们无法做太多改进，这可能仅与硬件有关（也许他们漏掉了什么？但在向他们提问之前，我已经在互联网上搜索了很多，也没有找到太多信息）。相比之下，导入numpy只需要大约1秒。

I would like to know if there is a way to use the saved weights of a pytorch model to ONLY predict an output with a given input without importing torch (so no need to import everything related to gradients, etc ...). Theoretically, it is just matrix multiplications (I think?) so it probably is feasible by only using numpy? I need to do this several times on different jobs so I cannot cache / pass around the imported torch which is why I'm actively looking for a solution (but generally speaking taking something from 30-40 seconds to a few is pretty cool anyway).

我想知道是否有一种方法可以使用pytorch模型的保存权重仅使用给定输入来预测输出，而无需导入torch（因此无需导入与梯度等相关的所有内容...）。从理论上讲，这只是矩阵乘法（我认为？），所以只使用numpy可能是可行的？我需要在不同的作业上多次执行此操作，因此无法缓存/传递导入的torch，这就是为什么我正在积极寻找解决方案的原因（但一般来说，将某物从30-40秒降低到几秒钟非常酷）。

If that matters, here is the architecture of my model:

如果有关系的话，这是我的模型架构：

ActionNN(
  (conv_1): Conv2d(5, 16, kernel_size=(3, 3), stride=(1, 1))
  (conv_2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
  (conv_3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (norm_layer_1): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
  (norm_layer_2): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
  (norm_layer_3): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
  (gap): AdaptiveAvgPool2d(output_size=(1, 1))
  (mlp): Sequential(
    (0): Linear(in_features=71, out_features=128, bias=True)
    (1): ReLU()
  )
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=28, bias=True)
  (layer): Linear(in_features=128, out_features=28, bias=True)
  (layer): Linear(in_features=128, out_features=28, bias=True)
  (sigmoid): Sigmoid()
  (tanh): Tanh()
)
Number of parameters: 152284

If it was only fully connected layers, it would be "pretty easy" but because my network is a tiny bit more complex, I'm not sure how I should do it.

如果只有全连接层，那就会很“容易”，但由于我的网络稍微复杂一些，我不确定该如何操作。

I saved the parameters using torch.save(my_network.state_dict(), my_path).

我使用torch.save(my_network.state_dict(), my_path)保存了参数。

Since my script takes in total on average 35 seconds (import torch included), I would be able to run it in on average a second or two, which would be great.

由于我的脚本总共需要平均35秒（包括import torch），我应该能够平均在1到2秒内运行它，这将非常好。

英文:

I'm using a Slurm Workload Manager on a server and import torch takes around 30-40 seconds. The IT people running it said they couldn't do much to improve it and it was just hardware related (maybe they missed something? but i've gone through the internet before asking them and couldn't find much either). By comparison, import numpy takes around 1 second.

If that matters, here is the architecture of my model:

ActionNN(
  (conv_1): Conv2d(5, 16, kernel_size=(3, 3), stride=(1, 1))
  (conv_2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
  (conv_3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (norm_layer_1): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
  (norm_layer_2): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
  (norm_layer_3): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
  (gap): AdaptiveAvgPool2d(output_size=(1, 1))
  (mlp): Sequential(
    (0): Linear(in_features=71, out_features=128, bias=True)
    (1): ReLU()
  )
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=210, bias=True)
  (layer): Linear(in_features=128, out_features=28, bias=True)
  (layer): Linear(in_features=128, out_features=28, bias=True)
  (layer): Linear(in_features=128, out_features=28, bias=True)
  (sigmoid): Sigmoid()
  (tanh): Tanh()
)
Number of parameters: 152284

If it was only fully connected layers, it would be "pretty easy" but because my network is a tiny bit more complex, I'm not sure how I should do it.

I saved the parameters using torch.save(my_network.state_dict(), my_path).

Since my script takes in total on average 35 seconds (import torch included), I would be able to run it in on average a second or two, which would be great.

Here is my profiling of import torch:

         1226310 function calls (1209639 primitive calls) in 49.994 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1273   21.590    0.017   21.590    0.017 {method &#39;read&#39; of &#39;_io.BufferedReader&#39; objects}
     5276   12.145    0.002   12.145    0.002 {built-in method posix.stat}
     1273    7.427    0.006    7.427    0.006 {built-in method io.open_code}
    45/25    5.631    0.125    9.939    0.398 {built-in method _imp.create_dynamic}
        2    0.564    0.282    0.564    0.282 {built-in method _ctypes.dlopen}
     1273    0.288    0.000    0.288    0.000 {built-in method marshal.loads}
       17    0.286    0.017    0.286    0.017 {method &#39;readline&#39; of &#39;_io.BufferedReader&#39; objects}
2809/2753    0.098    0.000    0.546    0.000 {built-in method builtins.__build_class__}
   1620/1    0.062    0.000   49.997   49.997 {built-in method builtins.exec}
    50145    0.051    0.000    0.119    0.000 {built-in method builtins.getattr}
     1159    0.048    0.000    0.115    0.000 inspect.py:3245(signature)
      424    0.048    0.000    0.113    0.000 assumptions.py:596(__init__)
       13    0.039    0.003    0.039    0.003 {built-in method io.open}
     1411    0.035    0.000    0.045    0.000 library.py:71(impl)
     1663    0.034    0.000   12.209    0.007 &lt;frozen importlib._bootstrap_external&gt;:1536(find_spec)

答案1

得分: 2

这是已翻译好的部分：

有一种简单的方法可以节省导入时间，那就是启动一个服务器，仅在启动时导入torch一次，并仅加载模型一次。

使用Flask或者更好的FastAPI，启动一个简单的HTTP服务器，在HTTP调用上运行脚本。

服务器将需要40秒才能启动，但然后任何推断调用将仅需连接和运行推断的时间。

from fastapi import Request, FastAPI
import torch

model = torch.load(<你的模型在这里>)

app = FastAPI()

@app.post("/predict")
async def inference(request: Request):
    input = request.json()
    prediction = model.predict(input)
    return {"predictions": predictions}

通过将数据发送到http://<你的主机>:<端口>/predict来使用任何客户端调用服务器。

更多详情请参阅https://fastapi.tiangolo.com/tutorial/first-steps/。

英文:

There is an easy way to save up on import time, it's to spin up a server, and import torch only once at start up and load the model once only.

Use Flask or better yet FastAPI, and spin up a simple HTTP server that will run the script on an HTTP call.

The server will take 40 seconds to start, but then any inference call will take just the time to connect and run inference.

from fastapi import Request, FastAPI
import torch

model = torch.load(&lt;yourmodel here&gt;)

app = FastAPI()

@app.post(&quot;/predict&quot;)
async def inference(request: Request):
    input = request.json()
    prediction = model.predict(input)
    return {&quot;predictions&quot;: predictions}

call the server with whatever client by posting data to http://<your-host>:<post>/predict

see https://fastapi.tiangolo.com/tutorial/first-steps/ for more details.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

更快的方式使用已保存的PyTorch模型（绕过import torch？）

问题

答案1

为什么不在Python和Ruby中使用协程或续延来进行Web编程呢？

如何遍历列表以生成URL的元素。

如何在一条消息中提及来自列表的多个用户？

将宽格式数据（分开的数据框）使用Python转换为长格式。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论