英文:
Zip and remove folder to reduce a Docker layer of a Dockerfile
问题
以下是翻译好的部分:
"Considering I have a huge Docker Image and that I don't have the time to dig into all the best Docker practices for image reduction techniques and that I am aware of the specifically huge part of the Docker image.
I wanted to zip the huge folder, delete the original folder, and simply unzip the folder when the image boots.
...
RUN cd /mypath/ && <COMMAND_THAT_GENERATE_HUGE_DIR_BOB> && zip -r bob.zip bob && rm -rd bob
...
In some way this is working since my bob.zip is about 60% of the original bob dir. However, probably due to the way Docker handles layers in the background. Even though bob.zip is smaller, the Docker layer appears to be:
(bob.zip size) + (original bob dir size)
instead of simply:
(bob.zip size)
Is there a flag or something I can use to get the expected layer size?"
英文:
Considering I have a huge Docker Image and that I don't have the time to dig into all the best Docker pratices for image reduction techniques and that I am aware of the specifically huge part of the Docker image.
I wanted to zip the huge folder, delete original folder and simply unzip the folder when the image boot.
...
RUN cd /mypath/ && <COMMAND_THAT_GENERATE_HUGE_DIR_BOB> && zip -r bob.zip bob && rm -rd bob
...
In some way this is working since my bob.zip is about 60% of the original bob dir. However, probably due to the way Docker handles layer in the background. Even tough bob.zip is smaller, the docker layer appears to be:
(bob.zip size) + (original bob dir size)
instead of simply:`
(bob.zip size)
Is there a flag or something I can use to get the expected layer size?
答案1
得分: 2
一个Dockerfile中的RUN
指令总是会使镜像变大。Docker镜像是由层构建而成的,RUN
指令的作用是从之前的层开始,执行一个命令,并将文件系统的变化记录为一个新的层。举一个极端的例子,RUN rm -rf /
实际上会导致镜像比前一步骤还要大一些,即使镜像中已经没有文件了。
你提到了多阶段构建,这是一个有效的方法。重要的是要在最终运行阶段之外的阶段压缩数据:
FROM ubuntu AS data
# RUN apt-get update && apt-get install ...
WORKDIR /app
COPY ./ ./
RUN ./command_that_generate_huge_dir_bob
RUN zip -r bob.zip bob # 也可以使用 `tar czf bob.tar.gz bob`
FROM ...
...
COPY --from=data /app/bob.zip ./
...
这里的重要细节是未经压缩的数据实际上并不存在于最终的镜像中;你只是将压缩文件复制进了镜像。
在问题中你展示的方法,即在单个RUN
指令中生成数据、压缩数据,然后删除原始数据,也是可行的,尽管可能会更加麻烦一些。
如果你使用COPY
将整个主机目录树复制进容器(就像我在这个例子中做的那样),要检查的一件事是,这样做不会意外地包含未经压缩的数据的副本。你可以在.dockerignore
文件中包含数据目录,以确保这一点。
如果这些数据确实是静态数据,另一种选择是在运行时将其提供给容器。我在以前使用过这种方法,当时有一个非常庞大的NLP模型,无法适应一个镜像。这样做会使得应用程序的分发和部署变得更加复杂,但它确实使镜像的大小更加合理(我发现镜像的实际上限大约是1GB,否则诸如docker pull
之类的操作会开始自发失败)。确保你的镜像根本不包含数据目录,然后在运行容器时将其绑定挂载进去:
# 在主机上
./command_that_generate_huge_dir_bob
docker run ... -v "$PWD/bob:/app/bob" ...
英文:
A Dockerfile RUN
line always makes the image larger. A Docker image is built from layers, and what a RUN
line does is start from a previous layer, run a command, and remember the filesystem changes as a new layer. As an extreme example, RUN rm -rf /
will actually result in an image somewhat larger than the preceding step, even though there are no files left in the image.
You hint at a multi-stage build and that's one valid approach here. The important thing is to compress the data in a separate stage from the final runtime stage:
FROM ubuntu AS data
# RUN apt-get update && apt-get install ...
WORKDIR /app
COPY ./ ./
RUN ./command_that_generate_huge_dir_bob
RUN zip -r bob.zip bob # I might use `tar czf bob.tar.gz bob`
FROM ...
...
COPY --from=data /app/bob.zip ./
...
The important detail here is that the uncompressed data never actually exists in the final image; you're only copying in the zipped file.
The approach you show in the question of generating the data, compressing it, and then deleting the raw data all inside a single RUN
command should work as well, though it can be a little more finicky.
If you do COPY
in the entire host directory tree (as I've done in this example) one thing to check is that this isn't accidentally including a copy of the uncompressed data. You can include the data directory in a .dockerignore
file to ensure this.
If this data really is static data, one other option is to supply it to the container at run time. I've used this approach in the past with a very large NLP model that simply didn't fit in an image. This makes distributing and deploying the application more complex, but it does make the image size more reasonable (I've found a practical limit for an image is about 1 GB, otherwise things like docker pull
start spontaneously failing). Make sure your image doesn't contain the data directory at all, then bind-mount it when you run the container
# on the host
./command_that_generate_huge_dir_bob
docker run ... -v "$PWD/bob:/app/bob" ...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论