英文:
Why does the Gradle cache contain dependencies multiple times?
问题
我们通过Gitlab Pipeline Jobs将Gradle缓存以ZIP格式上传到S3。解压其中一个ZIP文件(其中只包含.gradle
文件夹)显示许多依赖项以完全相同的版本多次包含:1次在jars-9
中,1次在modules-2
中。这为什么会发生,如何避免?由于这个原因,我们的CI缓存比实际需要的要大20-30%,尤其是对于像Kotlin编译器这样的大型依赖项。
这些JAR文件之间的大小差异可以归因于JAR文件的压缩开关状态,从内容上看它们是相同的。
官方解释对.gradle
文件夹的结构没有提供帮助。
英文:
We are uploading our Gradle caches as a ZIP to S3 via Gitlab Pipeline Jobs.
Unpacking one of those ZIP files (which just contains the .gradle
folder) has shown that a lot of dependencies are contained multiple times with the exact same version: 1x in jars-9
and 1x in modules-2
:
Why is this happening and how to avoid this? Our CI caches are 20-30% bigger than they need to be because of this, especially for big dependencies like the Kotlin compiler:
The size differences between the JARs can be attributed to JAR file compression being on or off, they are identical content-wise.
The official explanation on how the .gradle
folder is structured did not help.
答案1
得分: 4
Gradle的依赖缓存旨在提高效率和可靠性。它包括两种主要存储类型:
-
一个基于文件的存储,用于存放已下载的构件,包括二进制文件(如JAR文件)和原始下载的元数据,如POM和Ivy文件。已下载构件的存储路径包括SHA1校验和,这意味着具有相同名称但内容不同的两个构件可以轻松缓存(
$GRADLE_USER_HOME/caches
)。 -
一个二进制存储,用于已解析的模块元数据,包括解析动态版本、模块描述符和构件的结果。
你在Gradle缓存中看到的jars-*
和modules-2
目录涉及到这两种不同类型的存储。
jars-*
目录可能指的是已下载构件的基于文件的存储。存储在此目录中的每个构件在其存储路径中都包含SHA1校验和。这个设计使Gradle能够缓存具有相同名称但内容不同的两个构件,并确保如果具有相同SHA1校验和的构件已经存在于缓存中,就不会多次下载相同的构件。
另一方面,modules-2
目录可能指的是已解析的模块元数据的二进制存储。该目录以二进制格式记录了依赖解析的各个方面,包括将动态版本解析为具体版本的结果、特定模块的已解析模块元数据以及特定构件的已解析构件元数据。
这两个目录是不同的,因为它们具有不同的目的并存储不同类型的数据。
英文:
Gradle's dependency cache is designed for efficiency and reliability. It includes two primary storage types:
-
A file-based store of downloaded artifacts, including binaries such as jars and raw downloaded metadata like POM and Ivy files. The storage path for a downloaded artifact includes the SHA1 checksum, which means that two artifacts with the same name, but different content can be easily cached (
$GRADLE_USER_HOME/caches
). -
A binary store of resolved module metadata, including the results of resolving dynamic versions, module descriptors, and artifacts.
The jars-*
and modules-2
directories you are seeing in the Gradle cache pertain to these two different types of storage.
The jars-*
directory likely refers to the file-based store of downloaded artifacts. Each artifact stored in this directory includes the SHA1 checksum in its storage path. This design allows Gradle to cache two artifacts with the same name but different content, and it also ensures that the same artifact is not downloaded multiple times if it is already present in the cache with the same SHA1 checksum.
The modules-2
directory, on the other hand, likely refers to the binary store of resolved module metadata. This directory keeps a record of various aspects of dependency resolution in binary format, including the results of resolving dynamic versions to concrete versions, the resolved module metadata for a particular module, and the resolved artifact metadata for a particular artifact.
These two directories are distinct because they serve different purposes and store different types of data.
As discussed with Vampire in the comments, I added:
> when Gradle resolves a dependency, it downloads the dependency's JAR file and stores it in the modules-2
directory.
>
> If a classpath transformation is applied to the dependency, the result of the transformation (which could be identical to the original if an identity transformation is used) is stored in a jars-*
directory. This allows Gradle to cache the result of the transformation, avoiding the need to perform the transformation again if the same dependency is used with the same transformation in the future.
>
> And that would would explain the identical content.
In terms of reducing the size of your CI caches, the multiple instances of the same dependency in different cache directories might not be avoidable given the way Gradle's cache works.
However, you might be able to configure your CI/CD pipeline to only cache the necessary directories or files, or use techniques like incremental builds to minimize the amount of data that needs to be cached.
You might also consider cleaning the Gradle cache manually or programmatically on a regular basis to remove unused or obsolete files.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论