使用GO语言提取tar文件中的tar文件的最快方法是什么?

huangapple go评论107阅读模式
英文:

fastest way to extract tar files in side tar file using GO

问题

我有一个包含多个tar文件的tar文件。我目前正在使用tar Reader递归地提取这些tar文件,通过手动移动文件来进行操作。这个过程非常繁重和缓慢,特别是当处理包含数千个文件和目录的大型tar文件时。

我没有找到任何能够快速进行递归提取的好的包。此外,我尝试使用命令tar -xf file.tar --same-owner"来提取内部tar文件,但在权限问题上遇到了问题(这只在Mac上发生)。

我的问题是:
有没有办法并行化手动提取过程,以便内部tar文件可以并行提取?

我有一个提取任务的方法,我正在尝试将其并行化:

  1. var wg sync.WaitGroup
  2. wg.Add(len(tarFiles))
  3. for {
  4. header, err := tarBallReader.Next()
  5. if err != nil {
  6. break
  7. }
  8. go extractFileAsync(parentFolder, header, tarBallReader, depth, &wg)
  9. }
  10. wg.Wait()

在添加了go协程之后,文件变得损坏,并且进程陷入无限循环。

以下是主tar文件的示例内容:

  1. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/
  2. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/VERSION
  3. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/json
  4. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/layer.tar
  5. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/
  6. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/VERSION
  7. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/json
  8. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/layer.tar
  9. 54c027bf04447fdb035ddc13a6ae5493a3f997bdd3577607b0980954522efb9e.json
  10. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/
  11. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/VERSION
  12. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/json
  13. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/layer.tar
  14. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/
  15. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/VERSION
  16. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/json
  17. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/layer.tar
  18. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/
  19. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/VERSION
  20. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/json
  21. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/layer.tar
  22. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/
  23. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/VERSION
  24. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/json
  25. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/layer.tar
  26. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/
  27. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/VERSION
  28. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/json
  29. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/layer.tar
  30. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/
  31. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/VERSION
  32. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/json
  33. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/layer.tar
  34. manifest.json
  35. repositories

或者简单地运行docker save <image>:<tag> -o image.tar并检查tar文件的内容。

英文:

I have a tar file that contains multiple tar files in it. I'm currently extracting these tars recursively using the tar Reader by moving manually over the files. This process is very heavy and slow, especially when dealing with large tar files that contain thousands of files and directories.

I didn't find any good package that is able to do this recursive extraction fast. plus I tried using the command tar -xf file.tar --same-owner&quot; for the inner tars, but had a problem with permissions issue (which happens only on mac).

my question is:
Is there a way to parallelize the manual extraction process so that the inner tars will be extracted in parallel?

I have a method for the extraction task which I'm trying to make parallel:

  1. var wg sync.WaitGroup
  2. wg.Add(len(tarFiles))
  3. for {
  4. header, err := tarBallReader.Next()
  5. if err != nil {
  6. break
  7. }
  8. go extractFileAsync(parentFolder, header, tarBallReader, depth, &amp;wg)
  9. }
  10. wg.Wait()

after adding the go routines, the files are getting corrupted and the process is stuck on an endless loop.

example of the main tar content:

  1. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/
  2. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/VERSION
  3. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/json
  4. 1d2755f3375860aaaf2b5f0474692df2e0d4329569c1e8187595bf4b3bf3f3b9/layer.tar
  5. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/
  6. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/VERSION
  7. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/json
  8. 348188998f2a69b4ac0ca96b42990292eef67c0abfa05412e2fb7857645f4280/layer.tar
  9. 54c027bf04447fdb035ddc13a6ae5493a3f997bdd3577607b0980954522efb9e.json
  10. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/
  11. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/VERSION
  12. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/json
  13. 9dd3c29af50daaf86744a8ade86ecf12f6a5a6ffc27a5a7398628e4a21770ee3/layer.tar
  14. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/
  15. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/VERSION
  16. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/json
  17. b6c49400b643245cdbe17b7a7eb14f0f7def5a93326b99560241715c1e95502e/layer.tar
  18. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/
  19. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/VERSION
  20. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/json
  21. c662ec0dc487910e7b76b2a4d67ab1a9ca63ce1784f636c2637b41d6c7ac5a1e/layer.tar
  22. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/
  23. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/VERSION
  24. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/json
  25. da87454b77f6ac7fab1f465c10a07a1eb4b46df8058d98892794618cac8eacdc/layer.tar
  26. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/
  27. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/VERSION
  28. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/json
  29. ea1c2adfdc777d8746e50ad3e679789893a991606739c9bc7e01f273fa0b6e12/layer.tar
  30. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/
  31. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/VERSION
  32. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/json
  33. f3b6608e814053048d79e519be79f654a2e9364dfdc8fb87b71e2fc57bbff115/layer.tar
  34. manifest.json
  35. repositories

or simply you can run docker save &lt;image&gt;:&lt;tag&gt; -o image.tar and check the content of the tar.

答案1

得分: 1

可能是因为在执行过程中调用wg.Done()的次数与len(tarFiles)不相等,导致你的代码在wg.Wait()处挂起。

以下是修复后的代码示例:

  1. var wg sync.WaitGroup
  2. // wg.Add(len(tarFiles))
  3. for {
  4. header, err := tarBallReader.Next()
  5. if err != nil {
  6. break
  7. }
  8. wg.Add(1)
  9. go extractFileAsync(parentFolder, header, tarBallReader, depth, &wg)
  10. }
  11. wg.Wait()
  12. func extractFileAsync(...) {
  13. defer wg.Done()
  14. // 一些代码
  15. }

更新:修正了可能存在的竞态条件。感谢 @craigb 的指正。

这是我对类似问题的解决方案(简化版):

  1. package main
  2. import (
  3. "archive/tar"
  4. "fmt"
  5. "io"
  6. "os"
  7. "path/filepath"
  8. "strings"
  9. "sync"
  10. )
  11. type Semaphore struct {
  12. Wg sync.WaitGroup
  13. Ch chan int
  14. }
  15. // 同时运行的goroutine数量限制。
  16. // 取决于处理器核心数量、存储性能、内存量等因素。
  17. const grMax = 10
  18. const tarFileName = "docker_image.tar"
  19. const dstDir = "output/docker"
  20. func extractTar(tarFileName string, dstDir string) error {
  21. f, err := os.Open(tarFileName)
  22. if err != nil {
  23. return err
  24. }
  25. sem := Semaphore{}
  26. sem.Ch = make(chan int, grMax)
  27. if err := Untar(dstDir, f, &sem, true); err != nil {
  28. return err
  29. }
  30. fmt.Println("extractTar: 等待完成")
  31. sem.Wg.Wait()
  32. return nil
  33. }
  34. func Untar(dst string, r io.Reader, sem *Semaphore, godeep bool) error {
  35. tr := tar.NewReader(r)
  36. for {
  37. header, err := tr.Next()
  38. switch {
  39. case err == io.EOF:
  40. return nil
  41. case err != nil:
  42. return err
  43. }
  44. // 目标位置,用于创建目录/文件
  45. target := filepath.Join(dst, header.Name)
  46. switch header.Typeflag {
  47. // 如果是目录且不存在,则创建目录
  48. case tar.TypeDir:
  49. if _, err := os.Stat(target); err != nil {
  50. if err := os.MkdirAll(target, 0755); err != nil {
  51. return err
  52. }
  53. }
  54. // 如果是文件,则创建文件
  55. case tar.TypeReg:
  56. if err := saveFile(tr, target, os.FileMode(header.Mode)); err != nil {
  57. return err
  58. }
  59. ext := filepath.Ext(target)
  60. // 如果是tar文件且在顶层目录,进行解压缩
  61. if ext == ".tar" && godeep {
  62. sem.Wg.Add(1)
  63. // 使用缓冲通道限制同时运行的goroutine数量
  64. sem.Ch <- 1
  65. // 文件解压缩到与文件名(不包含扩展名)相同的目录中
  66. newDir := filepath.Join(dst, strings.TrimSuffix(header.Name, ".tar"))
  67. if err := os.Mkdir(newDir, 0755); err != nil {
  68. return err
  69. }
  70. go func(target string, newDir string, sem *Semaphore) {
  71. fmt.Println("启动goroutine,通道长度:", len(sem.Ch))
  72. fmt.Println("开始:", target)
  73. defer sem.Wg.Done()
  74. defer func() { <-sem.Ch }()
  75. // 打开内部tar文件
  76. ft, err := os.Open(target)
  77. if err != nil {
  78. fmt.Println(err)
  79. return
  80. }
  81. defer ft.Close()
  82. // 这里的godeep参数为false,以避免解压缩当前归档文件内的归档文件。
  83. if err := Untar(newDir, ft, sem, false); err != nil {
  84. fmt.Println(err)
  85. return
  86. }
  87. fmt.Println("完成:", target)
  88. }(target, newDir, sem)
  89. }
  90. }
  91. }
  92. return nil
  93. }
  94. func saveFile(r io.Reader, target string, mode os.FileMode) error {
  95. f, err := os.OpenFile(target, os.O_CREATE|os.O_RDWR, mode)
  96. if err != nil {
  97. return err
  98. }
  99. defer f.Close()
  100. if _, err := io.Copy(f, r); err != nil {
  101. return err
  102. }
  103. return nil
  104. }
  105. func main() {
  106. err := extractTar(tarFileName, dstDir)
  107. if err != nil {
  108. fmt.Println(err)
  109. }
  110. }

希望对你有帮助!

英文:

Probably your code hangs on wg.Wait() due to the fact that the number of calls to wg.Done() during execution is not equal to len(tarFiles).

That should work:

  1. var wg sync.WaitGroup
  2. // wg.Add(len(tarFiles))
  3. for {
  4. header, err := tarBallReader.Next()
  5. if err != nil {
  6. break
  7. }
  8. wg.Add(1)
  9. go extractFileAsync(parentFolder, header, tarBallReader, depth, &amp;wg)
  10. }
  11. wg.Wait()
  12. func extractFileAsync(...) {
  13. defer wg.Done()
  14. // some code
  15. }

UPD: correction of a possible race condition. Thanks @craigb

Here is my solution to a similar problem (simplified):

  1. package main
  2. import (
  3. &quot;archive/tar&quot;
  4. &quot;fmt&quot;
  5. &quot;io&quot;
  6. &quot;os&quot;
  7. &quot;path/filepath&quot;
  8. &quot;strings&quot;
  9. &quot;sync&quot;
  10. )
  11. type Semaphore struct {
  12. Wg sync.WaitGroup
  13. Ch chan int
  14. }
  15. // Limit on the number of simultaneously running goroutines.
  16. // Depends on the number of processor cores, storage performance, amount of RAM, etc.
  17. const grMax = 10
  18. const tarFileName = &quot;docker_image.tar&quot;
  19. const dstDir = &quot;output/docker&quot;
  20. func extractTar(tarFileName string, dstDir string) error {
  21. f, err := os.Open(tarFileName)
  22. if err != nil {
  23. return err
  24. }
  25. sem := Semaphore{}
  26. sem.Ch = make(chan int, grMax)
  27. if err := Untar(dstDir, f, &amp;sem, true); err != nil {
  28. return err
  29. }
  30. fmt.Println(&quot;extractTar: wait for complete&quot;)
  31. sem.Wg.Wait()
  32. return nil
  33. }
  34. func Untar(dst string, r io.Reader, sem *Semaphore, godeep bool) error {
  35. tr := tar.NewReader(r)
  36. for {
  37. header, err := tr.Next()
  38. switch {
  39. case err == io.EOF:
  40. return nil
  41. case err != nil:
  42. return err
  43. }
  44. // the target location where the dir/file should be created
  45. target := filepath.Join(dst, header.Name)
  46. switch header.Typeflag {
  47. // if its a dir and it doesn&#39;t exist create it
  48. case tar.TypeDir:
  49. if _, err := os.Stat(target); err != nil {
  50. if err := os.MkdirAll(target, 0755); err != nil {
  51. return err
  52. }
  53. }
  54. // if it&#39;s a file create it
  55. case tar.TypeReg:
  56. if err := saveFile(tr, target, os.FileMode(header.Mode)); err != nil {
  57. return err
  58. }
  59. ext := filepath.Ext(target)
  60. // if it&#39;s tar file and we are on top level, extract it
  61. if ext == &quot;.tar&quot; &amp;&amp; godeep {
  62. sem.Wg.Add(1)
  63. // A buffered channel is used to limit the number of simultaneously running goroutines
  64. sem.Ch &lt;- 1
  65. // the file is unpacked to a directory with the file name (without extension)
  66. newDir := filepath.Join(dst, strings.TrimSuffix(header.Name, &quot;.tar&quot;))
  67. if err := os.Mkdir(newDir, 0755); err != nil {
  68. return err
  69. }
  70. go func(target string, newDir string, sem *Semaphore) {
  71. fmt.Println(&quot;start goroutine, chan length:&quot;, len(sem.Ch))
  72. fmt.Println(&quot;START:&quot;, target)
  73. defer sem.Wg.Done()
  74. defer func() {&lt;-sem.Ch}()
  75. // the internal tar file opens
  76. ft, err := os.Open(target)
  77. if err != nil {
  78. fmt.Println(err)
  79. return
  80. }
  81. defer ft.Close()
  82. // the godeep parameter is false here to avoid unpacking archives inside the current archive.
  83. if err := Untar(newDir, ft, sem, false); err != nil {
  84. fmt.Println(err)
  85. return
  86. }
  87. fmt.Println(&quot;DONE:&quot;, target)
  88. }(target, newDir, sem)
  89. }
  90. }
  91. }
  92. return nil
  93. }
  94. func saveFile(r io.Reader, target string, mode os.FileMode) error {
  95. f, err := os.OpenFile(target, os.O_CREATE|os.O_RDWR, mode)
  96. if err != nil {
  97. return err
  98. }
  99. defer f.Close()
  100. if _, err := io.Copy(f, r); err != nil {
  101. return err
  102. }
  103. return nil
  104. }
  105. func main() {
  106. err := extractTar(tarFileName, dstDir)
  107. if err != nil {
  108. fmt.Println(err)
  109. }
  110. }

huangapple
  • 本文由 发表于 2022年11月4日 00:30:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/74306502.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定