2020年1月6日 22:40:47go评论100阅读模式

英文:

PyPDF4 - Exported PDF file size too big

问题

我有一个大约有7000页、479 MB的PDF文件。
我创建了一个Python脚本，使用PyPDF4来提取只包含特定单词的特定页面。
脚本可以运行，但新的PDF文件，尽管只包含原始7000页中的650页，现在比原始文件更大（确切地说是498 MB）。

是否有降低新PDF文件大小的方法？

我使用的脚本如下：

from PyPDF4 import PdfFileWriter, PdfFileReader
import os
import re

output = PdfFileWriter()

input = PdfFileReader(open('Binder.pdf', 'rb')) # 打开输入文件

for i in range(0, input.getNumPages()):
    content = ""
    content += input.getPage(i).extractText() + "\n"

    # 格式1
    RS = re.search('FIGURE', content)
    RS1 = #... 仅提供一个示例搜索。我有更多，但与问题无关。
    #....

    # 格式2
    RS20 = re.search('FIG.', content)
    RS21 = #... 仅提供一个示例搜索。我有更多，但与问题无关。
    #....

    if (all(v is not None for v in [RS, RS1, RS2, RS3, RS4, RS5, RS6, RS7, RS8, RS9]) or all(v is not None for v in [RS20, RS21, RS22, RS23, RS24, RS25, RS26, RS27, RS28, RS29, RS30, RS30])):
        p = input.getPage(i)
        output.addPage(p)

# 保存页面到新的PDF文件
with open('ExtractedPages.pdf', 'wb') as f:
    output.write(f)

希望这对你有帮助。如果需要进一步减小文件大小，可以考虑使用其他工具来优化PDF文件。

英文:

I have a PDF file of around 7000 pages and 479 MB.
I have create a python script using PyPDF4 to extract only specific pages if the pages contain specific words.
The script works but the new PDF file, even though it has only 650 pages from the original 7000, now has more MB that the original file (498 MB to be exactly).

Is there any way to lower the filesize of the new PDF?

The script I used:

from PyPDF4 import PdfFileWriter, PdfFileReader
import os
import re


output = PdfFileWriter()

input = PdfFileReader(open(&#39;Binder.pdf&#39;, &#39;rb&#39;)) # open input

for i in range(0, input.getNumPages()):
    content = &quot;&quot;
    content += input.getPage(i).extractText() + &quot;\n&quot;
    
    
    #Format 1
    RS = re.search(&#39;FIGURE&#39;, content)
    RS1 = #... Only one search given as example. I have more, but are irrelevant for the question.
    #....
    
    # Format 2
    RS20 = re.search(&#39;FIG.&#39;, content)
    RS21 = #... Only one search given as example. I have more, but are irrelevant for the question.
    #....
    
    if (all(v is not None for v in [RS, RS1, RS2, RS3, RS4, RS5, RS6, RS7, RS8, RS9]) or all(v is not None for v in [RS20, RS21, RS22, RS23, RS24, RS25, RS26, RS27, RS28, RS29, RS30, RS30])):
        p = input.getPage(i)
        output.addPage(p)
    
#Save pages to new PDF file
with open(&#39;ExtractedPages.pdf&#39;, &#39;wb&#39;) as f:
    output.write(f)

答案1

得分: 10

以下是已翻译好的部分：

在经过大量搜索后找到了一些解决方案。
导出的PDF文件唯一的问题是它是未压缩的uncompressed。所以我需要一个压缩PDF文件的解决方案：

PyPDF2和/或PyPDF4没有压缩PDF的选项。
PyPDF2有compressContentStreams()方法，但不起作用。
找到了一些其他声称可以压缩PDF的解决方案，但对我来说都不起作用（将它们添加在这里以防它们对其他人有效）：
pylovepdf；pdfsizeopt；pdfc。
对我有效的第一个解决方案是Adobe Acrobat专业版。它将大小从498 MB减小到2.99 MB。
[最佳解决方案] 作为一个可行的开源解决方案，我找到了coherentpdf。
对于Windows，您可以下载预构建的PDF挤压工具。
然后在cmd中：

cpdfsqueeze.exe input.pdf output.pdf

实际上，这比Adobe Acrobat还要压缩PDF。从498 MB减小到2.48 MB。从原始大小压缩到0.5%。我认为这是最佳解决方案，因为它可以添加到您的Python代码中。

**编辑：**找到了另一个具有GUI的免费解决方案。PDFsam。您可以在一个PDF文件上使用合并功能，并在高级设置中确保选中了“Compress Output”。这将从498减小到3.2 MB。

英文:

After a lot of searching found some solutions.
The only problem with the exported PDF file was that it was uncompressed. So I needed a solution to compress a PDF file:

PyPDF2 and/or PyPDF4 do not have an option to compress PDFs.
PyPDF2 had the compressContentStreams() method, which doesn't work.
Found a few other solutions that claim to compress PDFs, but none worked for me (adding them here just in case they work for others):
pylovepdf ; pdfsizeopt ; pdfc
The first solution that worked for me was Adobe Acrobat professional. It reduced the size from 498 MB to 2.99 MB.
[Best Solution] As an alternative, open source solution that works, I found coherentpdf.
For Windows you can download the pre-built PDF squeezer tool.
Then in cmd:

cpdfsqueeze.exe input.pdf output.pdf

This actually compressed the PDF even more than Adobe Acrobat. From 498 MB to 2.48 MB. Compressed to 0.5% from original. I think this is the best solution as it can be added to your Python Code.

Edit: Found another Free solution that has a GUI also. PDFsam. You can use the Merge feature on one PDF file, and in the advanced Settings make sure you have the Compress Output checked. This compressed from 498 to 3.2 MB.

答案2

得分: 4

在Linux中，您可以使用ps2pdf工具来压缩生成的PDF文件，该工具是ghostscript套件的一部分。
安装ghostscript：

$ sudo apt-get install ghostscript

运行以下命令以减小大型PDF文件的大小：

$ ps2pdf large.pdf compressed.pdf

当我尝试这样做时，我没有发现任何质量损失。

英文:

In Linux, you can compress the resulting pdf file using ps2pdf tool, which is a part of ghostscript suite.
Install ghostscript:

$ sudo apt-get install ghostscript

Run the following command to reduce the size of a large pdf file

$ ps2pdf large.pdf compressed.pdf

When I tried this, I did not find any loss in quality.

答案3

得分: 2

如果您可以接受在PDF中失去任何链接，尝试在保存文件之前调用PdfFileWriter.removeLinks()函数。我遇到了同样的问题，但在保存之前调用此函数，将文件大小从44.7MB减小到仅1.09MB。

英文:

If you're okay with losing any links in the PDF, try calling the PdfFileWriter.removeLinks() function before you save the file. I was having the same issue, but calling this function before I saved brought my file size down from 44.7MB to just 1.09MB.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PyPDF4 – 导出的PDF文件大小过大

问题

答案1

答案2

答案3

将格式 ‘year-month-dayT11:04:11.373Z’ 在 Python 中转换为只有 y/m/d。

从字典中检索键和值，未找到值

Pandas主要版本中DataFrame操作的更改？

在Python中使用Selenium点击按钮时出错。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论