2023年5月21日 08:22:04go评论69阅读模式

英文:

PyPDF2 unable to compress pdf

问题

I want to show an embed pdf on streamlit app which has a limitation of <2MB pdf size to be displayed.

So I am trying to compress the pdf file which a user uploads via st.file_uploader on streamit app using PyPDF2 package. Here's the code I used:

from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO

def compress_pdf(pdf_file, target_size):
  # Load PDF using PyPDF2
    pdf_reader = PdfReader(pdf_file)

    # Compress PDF using PyPDF2
    output_pdf = BytesIO()
    pdf_writer = PdfWriter()
    for page in pdf_reader.pages:
        page.compress_content_streams()  # This is CPU intensive!
        pdf_writer.add_page(page)

    # Get the compressed PDF bytes
    pdf_writer.write(output_pdf)
    compressed_pdf_bytes = output_pdf.getvalue()
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf

    return compressed_pdf_bytes

The above function takes in the file uploaded by the user on streamlit as below:

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    compressed_pdf_bytes= compress_pdf(uploaded_file, 2000000)

Even after doing all that, I see the file isn't compressed AT ALL.

This is the output in terminal as u see. The file uploaded (output of and the length of compressed_pdf_bytes is almost same.

#Output of actual file uploaded.
UploadedFile(id=6, name='abc.pdf', type='application/pdf', size=4588407)

#output of compressed file in bytes
4472714

英文:

I want to show an embed pdf on streamlit app which has a limitation of <2MB pdf size to be displayed.

So I am trying to compress the pdf file which a user uploads via st.file_uploader on streamit app using PyPDF2 package. Here's the code I used:

from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO

def compress_pdf(pdf_file, target_size):
  # Load PDF using PyPDF2
    pdf_reader = PdfReader(pdf_file)

    # Compress PDF using PyPDF2
    output_pdf = BytesIO()
    pdf_writer = PdfWriter()
    for page in pdf_reader.pages:
        page.compress_content_streams()  # This is CPU intensive!
        pdf_writer.add_page(page)

    # Get the compressed PDF bytes
    pdf_writer.write(output_pdf)
    compressed_pdf_bytes = output_pdf.getvalue()
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf

    return compressed_pdf_bytes

The above function takes in the file uploaded by the user on streamlit as below:

uploaded_file = st.sidebar.file_uploader(&quot;Upload a file&quot;, key= &quot;uploaded_file&quot;)
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    compressed_pdf_bytes= compress_pdf(uploaded_file, 2000000)

Even after doing all that, I see the file isn't compressed AT ALL.

This is the output in terminal as u see. The file uploaded (output of and the length of compressed_pdf_bytes is almost same.

#Output of actual file uploaded.
UploadedFile(id=6, name=&#39;abc.pdf&#39;, type=&#39;application/pdf&#39;, size=4588407)

#output of compressed file in bytes
4472714

答案1

得分: 1

If you want to try PyMuPDF: python -m pip install pymupdf.
Then do this

import fitz  # pymupdf

def compress_pdf(pdf_file, target_size):
    # Load PDF using pymupdf
    doc = fitz.open(pdf_file)
    compressed_pdf_bytes = doc.tobytes(
        deflate=True,
        garbage=4,
    )
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf
    return compressed_pdf_bytes

The above solution is lossless!
You can also try to build subset fonts - if your PDFs contain fonts that are not subsetted, the gain may be significant.

If your 2 MB limit is a hard one and you are ready to sacrifice things: PyMuPDF also offers to remove images.
With some more programming effort, you can also replace images by their gray-scale equivalents.

Here is a snippet to remove all images from a PDF:

doc = fitz.open(pdf_file)
for page in doc:
    for xref in [item[0] for item in page.get_images()]:
        page.delete_image(xref)
doc.save("no-images.pdf", garbage=4, deflate=True)

英文:

If you want to try PyMuPDF: python -m pip install pymupdf.
Then do this

import fitz  # pymupdf

def compress_pdf(pdf_file, target_size):
    # Load PDF using pymupdf
    doc = fitz.open(pdf_file)
    compressed_pdf_bytes = doc.tobytes(
        deflate=True,
        garbage=4,
    )
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf
    return compressed_pdf_bytes

The above solution is lossless!
You can also try to build subset fonts - if your PDFs contain fonts that are not subsetted, the gain may be significant.

Here is a snippet to remove all images from a PDF:

doc = fitz.open(pdf_file)
for page in doc:
    for xref in [item[0] for item in page.get_images()]:
        page.delete_image(xref)
doc.save(&quot;no-images.pdf&quot;, garbage=4, deflate=True)

答案2

得分: 1

你可以尝试 cpdf -squeeze in.pdf -o out.pdf 这个无损压缩方法。

一般来说，如果你的文件被正确创建，压缩的空间会很小。特别是如果文件主要是图片和字体。

英文:

You can try cpdf -squeeze in.pdf -o out.pdf which is lossless compression.

Generally speaking, if your file has been created competently, there should be little space to squeeze. Especially if it is mostly images and fonts.

答案3

得分: 0

以下是您要翻译的内容：

这并不总是可能压缩文件。在某个时候，它有其最小形式。

pypdf关于文件大小的文档提供了一些重要的提示。它们特别指出，移除图像可以减小文件大小。

英文:

It is not always possible to compress the file. At some point, it has its minimal form.

The pypdf docs about file size give some important hints. They especially point out that removing images can reduce the file size.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PyPDF2 无法压缩 PDF。

问题

答案1

答案2

答案3

sklearn 输入包含 NaN、无穷大或超出 dtype(‘float64’) 范围的值。

CORS问题在使用Flask + Typescript进行POST请求时出现。

将多个JSON文件的列表转换为Pandas数据框。

Keras自定义优化器数值错误：缺少学习率

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论