如何解压/解缩PDF流?

huangapple go评论87阅读模式
英文:

How to decompress/deflate PDF Stream

问题

使用2016-W4 pdf文件进行操作,其中有2个大的流(第1页和第2页),还有一堆其他对象和较小的流。我试图解压缩流以处理源数据,但遇到了困难。我只能得到损坏的输入和无效的校验和错误。

我编写了一个测试脚本来帮助调试,并从文件中提取了较小的流进行测试。

以下是原始pdf中的两个流及其长度对象:

流1

149 0 obj
<< /Length 150 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 8 8] /Resources 151 0 R >>
stream
x+TT(T0B ,JUWÈS0Ð37±402V(NF–JSþ¶
«
endstream
endobj
150 0 obj
42
endobj

流2

142 0 obj
<< /Length 143 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 0 0] /Resources 144 0 R >>
stream
x+Tçã;
endstream
endobj
143 0 obj
11
endobj

我将stream内容复制到Vim中的新文件中(在stream之后和endstream之前去除回车符)。

我尝试了以下两种方法:

  • compress/flate(删除前两个字节(CMFFLG))
  • compress/zlib

我将流转换为[]byte,如下所示:

package main

import (
	"bytes"
	"compress/flate"
	"compress/gzip"
	"compress/zlib"
	"fmt"
	"io"
	"os"
)

var (
	flateReaderFn = func(r io.Reader) (io.ReadCloser, error) { return flate.NewReader(r), nil }
	zlibReaderFn  = func(r io.Reader) (io.ReadCloser, error) { return zlib.NewReader(r) }
)

func deflate(b []byte, skip, length int, newReader func(io.Reader) (io.ReadCloser, error)) {
	// rfc-1950
	// --------
	//   First 2 bytes
	//   [120, 1] - CMF, FLG
	//
	//   CMF: 120
	//     0111 1000
	//     ↑    ↑
	//     |    CM(8) = deflate compression method
	//     CINFO(7)   = 32k LZ77 window size
	//
	//   FLG: 1
	//     0001 ← FCHECK
	//            (CMF*256 + FLG) % 31 == 0
	//             120 * 256 + 1 = 30721
	//                             30721 % 31 == 0

	stream := bytes.NewReader(b[skip:length])
	r, err := newReader(stream)
	if err != nil {
		fmt.Println("\nfailed to create reader,", err)
		return
	}

	n, err := io.Copy(os.Stdout, r)
	if err != nil {
		if n > 0 {
			fmt.Print("\n")
		}
		fmt.Println("\nfailed to write contents from reader,", err)
		return
	}
	fmt.Printf("%d bytes written\n", n)
	r.Close()
}

func main() {
	//readerFn, skip := flateReaderFn, 2 // compress/flate RFC-1951, ignore first 2 bytes
	readerFn, skip := zlibReaderFn, 0 // compress/zlib RFC-1950, ignore nothing

	//                                                                                                ⤹ This is where the error occurs: `flate: corrupt input before offset 19`.
	stream1 := []byte{120, 1, 43, 84, 8, 84, 40, 84, 48, 0, 66, 11, 32, 44, 74, 85, 8, 87, 195, 136, 83, 48, 195, 144, 51, 55, 194, 177, 52, 48, 50, 86, 40, 78, 70, 194, 150, 74, 83, 8, 4, 0, 195, 190, 194, 182, 10, 194, 171, 10}
	stream2 := []byte{120, 1, 43, 84, 8, 4, 0, 1, 195, 167, 0, 195, 163, 10}

	fmt.Println("----------------------------------------\nStream 1:")
	deflate(stream1, skip, 42, readerFn) // flate: corrupt input before offset 19

	fmt.Println("----------------------------------------\nStream 2:")
	deflate(stream2, skip, 11, readerFn) // invalid checksum
}

我确定我在某个地方做错了什么,只是还没有找到。

(pdf在查看器中可以打开)

英文:

Working with the 2016-W4 pdf, which has 2 large streams (page 1 & 2), along with a bunch of other objects and smaller streams. I'm trying to deflate the stream(s), to work with the source data, but am struggling. I'm only able to get corrupt inputs and invalid checksums errors.

I've written a test script to help debug, and have pulled out smaller streams from the file to test with.

Here are 2 streams from the original pdf, along with their length objects:

stream 1:

149 0 obj
&lt;&lt; /Length 150 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 8 8] /Resources 151 0 R &gt;&gt;
stream
x+TT(T0B ,JUW&#200;S0&#208;37&#177;402V(NF–JS&#254;&#182;
&#171;
endstream
endobj
150 0 obj
42
endobj

stream 2

142 0 obj
&lt;&lt; /Length 143 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 0 0] /Resources 144 0 R &gt;&gt;
stream
x+T&#231;&#227;
endstream
endobj
143 0 obj
11
endobj

I copied just the stream contents into new files within Vim (excluding the carriage returns after stream and before endstream).

I've tried both:

  • compress/flate (rfc-1951) – (removing the first 2 bytes (CMF, FLG))
  • compress/zlib (rfc-1950)

I've converted the streams to []byte for the below:

package main
import (
&quot;bytes&quot;
&quot;compress/flate&quot;
&quot;compress/gzip&quot;
&quot;compress/zlib&quot;
&quot;fmt&quot;
&quot;io&quot;
&quot;os&quot;
)
var (
flateReaderFn = func(r io.Reader) (io.ReadCloser, error) { return flate.NewReader(r), nil }
zlibReaderFn  = func(r io.Reader) (io.ReadCloser, error) { return zlib.NewReader(r) }
)
func deflate(b []byte, skip, length int, newReader func(io.Reader) (io.ReadCloser, error)) {
// rfc-1950
// --------
//   First 2 bytes
//   [120, 1] - CMF, FLG
//
//   CMF: 120
//     0111 1000
//     ↑    ↑
//     |    CM(8) = deflate compression method
//     CINFO(7)   = 32k LZ77 window size
//
//   FLG: 1
//     0001 ← FCHECK
//            (CMF*256 + FLG) % 31 == 0
//             120 * 256 + 1 = 30721
//                             30721 % 31 == 0
stream := bytes.NewReader(b[skip:length])
r, err := newReader(stream)
if err != nil {
fmt.Println(&quot;\nfailed to create reader,&quot;, err)
return
}
n, err := io.Copy(os.Stdout, r)
if err != nil {
if n &gt; 0 {
fmt.Print(&quot;\n&quot;)
}
fmt.Println(&quot;\nfailed to write contents from reader,&quot;, err)
return
}
fmt.Printf(&quot;%d bytes written\n&quot;, n)
r.Close()
}
func main() {
//readerFn, skip := flateReaderFn, 2 // compress/flate RFC-1951, ignore first 2 bytes
readerFn, skip := zlibReaderFn, 0 // compress/zlib RFC-1950, ignore nothing
//                                                                                                ⤹ This is where the error occurs: `flate: corrupt input before offset 19`.
stream1 := []byte{120, 1, 43, 84, 8, 84, 40, 84, 48, 0, 66, 11, 32, 44, 74, 85, 8, 87, 195, 136, 83, 48, 195, 144, 51, 55, 194, 177, 52, 48, 50, 86, 40, 78, 70, 194, 150, 74, 83, 8, 4, 0, 195, 190, 194, 182, 10, 194, 171, 10}
stream2 := []byte{120, 1, 43, 84, 8, 4, 0, 1, 195, 167, 0, 195, 163, 10}
fmt.Println(&quot;----------------------------------------\nStream 1:&quot;)
deflate(stream1, skip, 42, readerFn) // flate: corrupt input before offset 19
fmt.Println(&quot;----------------------------------------\nStream 2:&quot;)
deflate(stream2, skip, 11, readerFn) // invalid checksum
}

I'm sure I'm doing something wrong somewhere, I just can't quite see it.

(The pdf does open in a viewer)

答案1

得分: 4

二进制数据绝对不应该从文本编辑器中复制或保存。可能有些情况下这样做会成功,但这只会雪上加霜。

你从 PDF 中提取出来的数据很可能与实际的 PDF 中的数据并不相同。你应该从十六进制编辑器中获取数据(例如,尝试使用 hecate 这个新工具),或者编写一个简单的应用程序来保存数据(严格处理文件为二进制格式)。

提示1:

二进制数据显示在多行上。二进制数据不包含回车符,那是文本控制符。如果包含回车符,那意味着编辑器将其解释为文本,并且某些代码/字符被“消耗”以开始新的一行。多个序列可能被解释为相同的换行符(例如\n\r\n)。通过排除它们,你已经丢失了数据;通过包含它们,你可能已经得到了不同的序列。如果数据被解释并显示为文本,可能会出现更多问题,因为有更多的控制字符,并且某些字符在显示时可能不会出现。

提示2:

当使用flateReaderFn时,解码第二个示例成功(没有错误)。这意味着“你正在正确的方向上努力”,但成功与实际数据是什么以及文本编辑器对数据的“扭曲”程度有关。

英文:

Binary data should never be copied out of / saved from text editors. There might be cases when this succeeds, and it just adds oil to the flame.

Your data that you eventually "mined out" from the PDF is most likely not identical to the actual data that is in the PDF. You should take the data from a hex editor (e.g. try hecate for something new), or write a simple app that saves it (which strictly handles the file as binary).

Hint #1:

The binary data displayed spread across multiple lines. Binary data does not contain carriage returns, that's a textual control. If it does, that means the editor did interpret it as text, and so some codes / characters where "consumed" to start a new line. Multiple sequences may be interpreted as the same newline (e.g. \n, \r\n). By excluding them, you're already at data loss, by including them, you might already have a different sequence. And if the data was interpreted and displayed as text, more problems may arise as there are more control characters, and some characters may not appear when displayed.

Hint #2:

When flateReaderFn is used, decoding the 2nd example succeeds (completes without an error). This means "you were barking up the right tree", but the success depends on what the actual data is and to what extent was it "distorted" by the text editor.

答案2

得分: 2

好的,我会为你翻译以下内容:

好吧,是时候坦白了...

我一直在努力理解deflate算法,完全忽视了Vim没有正确将流内容保存到新文件中的事实。所以我花了很多时间阅读RFC文档,并深入研究了Go语言的compress/...包的内部,以为问题出在我的代码上。

在我发布问题后不久,我尝试将整个PDF文件读取出来,找到stream/endstream的位置,并将其通过deflate处理。当我看到内容在屏幕上滚动时,我意识到了自己的愚蠢错误。

+1 @icza,那正是我的问题所在。

最终这是件好事,因为我对整个过程有了更好的理解,如果一开始就能成功,我可能就不会有这么深入的了解了。

英文:

Okay, confession time...

I was so caught up in trying to understand deflate that I completely overlooked the fact that Vim wasn't saving the stream contents correctly into new files. So I spent quite a bit of time reading the RFC's, and digging through the internals of the Go compress/... packages, assuming the problem was with my code.

Shortly after I posted my question I tried reading the PDF as a whole, finding the stream/endstream locations, and pushing that through deflate. As soon as I saw the content scroll through the screen I realized my dumb mistake.

+1 @icza, that was exactly my issue.

It was good in then end, as I have a much better understanding of the whole process than if it would have just worked the first go around.

答案3

得分: 0

从PDF中提取对象可能会因使用的过滤器而变得棘手。过滤器还可以具有需要正确处理的其他选项。

对于那些对提取对象感兴趣但不关心底层细节的人来说,可以按如下方式获取并解码PDF中的单个对象:

package main

import (
	"fmt"
	"os"
	"strconv"

	"github.com/unidoc/unipdf/v3/core"
	"github.com/unidoc/unipdf/v3/model"
)


func main() {
	objNum := 149 // 获取对象149
	err := inspectPdfObject("input.pdf", objNum)
	if err != nil {
		fmt.Printf("错误:%v\n", err)
		os.Exit(1)
	}
}

func inspectPdfObject(inputPath string, objNum int) error {
	f, err := os.Open(inputPath)
	if err != nil {
		return err
	}

	defer f.Close()

	pdfReader, err := model.NewPdfReader(f)
	if err != nil {
		return err
	}

	isEncrypted, err := pdfReader.IsEncrypted()
	if err != nil {
		return err
	}

	if isEncrypted {
		// 如果加密了,尝试使用空密码解密。
		// 也可以通过修改下面的行来指定用户/所有者密码。
		auth, err := pdfReader.Decrypt([]byte(""))
		if err != nil {
			fmt.Printf("解密错误:%v\n", err)
			return err
		}
		if !auth {
			fmt.Println("该文件已使用开启密码进行加密。修改代码以指定密码。")
			return nil
		}
	}

	obj, err := pdfReader.GetIndirectObjectByNumber(objNum)
	if err != nil {
		return err
	}

	fmt.Printf("对象 %d:%s\n", objNum, obj.String())

	if stream, is := obj.(*core.PdfObjectStream); is {
		decoded, err := core.DecodeStream(stream)
		if err != nil {
			return err
		}
		fmt.Printf("解码结果:\n%s", decoded)
	} else if indObj, is := obj.(*core.PdfIndirectObject); is {
		fmt.Printf("%T\n", indObj.PdfObject)
		fmt.Printf("%s\n", indObj.PdfObject.String())
	}

	return nil
}

完整示例:pdf_get_object.go

声明:我是UniPDF的原始开发者。

英文:

Extracting objects from PDF can be tricky depending on the filters used. The filter can also have additional options which need to be handled correctly.

For someone interested in extracting an object without taking care of the low-level details of the process.

To get a single object from a PDF and decode it can be done as follows:

package main

import (
	&quot;fmt&quot;
	&quot;os&quot;
	&quot;strconv&quot;

	&quot;github.com/unidoc/unipdf/v3/core&quot;
	&quot;github.com/unidoc/unipdf/v3/model&quot;
)


func main() {
	objNum := 149 // Get object 149
	err := inspectPdfObject(&quot;input.pdf&quot;, objNum)
	if err != nil {
		fmt.Printf(&quot;Error: %v\n&quot;, err)
		os.Exit(1)
	}
}

func inspectPdfObject(inputPath string, objNum int) error {
	f, err := os.Open(inputPath)
	if err != nil {
		return err
	}

	defer f.Close()

	pdfReader, err := model.NewPdfReader(f)
	if err != nil {
		return err
	}

	isEncrypted, err := pdfReader.IsEncrypted()
	if err != nil {
		return err
	}

	if isEncrypted {
		// If encrypted, try decrypting with an empty one.
		// Can also specify a user/owner password here by modifying the line below.
		auth, err := pdfReader.Decrypt([]byte(&quot;&quot;))
		if err != nil {
			fmt.Printf(&quot;Decryption error: %v\n&quot;, err)
			return err
		}
		if !auth {
			fmt.Println(&quot; This file is encrypted with opening password. Modify the code to specify the password.&quot;)
			return nil
		}
	}

	obj, err := pdfReader.GetIndirectObjectByNumber(objNum)
	if err != nil {
		return err
	}

	fmt.Printf(&quot;Object %d: %s\n&quot;, objNum, obj.String())

	if stream, is := obj.(*core.PdfObjectStream); is {
		decoded, err := core.DecodeStream(stream)
		if err != nil {
			return err
		}
		fmt.Printf(&quot;Decoded:\n%s&quot;, decoded)
	} else if indObj, is := obj.(*core.PdfIndirectObject); is {
		fmt.Printf(&quot;%T\n&quot;, indObj.PdfObject)
		fmt.Printf(&quot;%s\n&quot;, indObj.PdfObject.String())
	}

	return nil
}

A full example: pdf_get_object.go

Disclosure: I am the original developer of UniPDF.

huangapple
  • 本文由 发表于 2017年2月21日 06:37:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/42355485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定