PyPDF2 无法压缩 PDF。

huangapple go评论58阅读模式
英文:

PyPDF2 unable to compress pdf

问题

I want to show an embed pdf on streamlit app which has a limitation of <2MB pdf size to be displayed.

So I am trying to compress the pdf file which a user uploads via st.file_uploader on streamit app using PyPDF2 package. Here's the code I used:

from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO

def compress_pdf(pdf_file, target_size):
  # Load PDF using PyPDF2
    pdf_reader = PdfReader(pdf_file)

    # Compress PDF using PyPDF2
    output_pdf = BytesIO()
    pdf_writer = PdfWriter()
    for page in pdf_reader.pages:
        page.compress_content_streams()  # This is CPU intensive!
        pdf_writer.add_page(page)

    # Get the compressed PDF bytes
    pdf_writer.write(output_pdf)
    compressed_pdf_bytes = output_pdf.getvalue()
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf

    return compressed_pdf_bytes

The above function takes in the file uploaded by the user on streamlit as below:

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    compressed_pdf_bytes= compress_pdf(uploaded_file, 2000000)

Even after doing all that, I see the file isn't compressed AT ALL.

This is the output in terminal as u see. The file uploaded (output of and the length of compressed_pdf_bytes is almost same.

#Output of actual file uploaded.
UploadedFile(id=6, name='abc.pdf', type='application/pdf', size=4588407)

#output of compressed file in bytes
4472714

英文:

I want to show an embed pdf on streamlit app which has a limitation of <2MB pdf size to be displayed.

So I am trying to compress the pdf file which a user uploads via st.file_uploader on streamit app using PyPDF2 package. Here's the code I used:

from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO

def compress_pdf(pdf_file, target_size):
  # Load PDF using PyPDF2
    pdf_reader = PdfReader(pdf_file)

    # Compress PDF using PyPDF2
    output_pdf = BytesIO()
    pdf_writer = PdfWriter()
    for page in pdf_reader.pages:
        page.compress_content_streams()  # This is CPU intensive!
        pdf_writer.add_page(page)

    # Get the compressed PDF bytes
    pdf_writer.write(output_pdf)
    compressed_pdf_bytes = output_pdf.getvalue()
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf

    return compressed_pdf_bytes

The above function takes in the file uploaded by the user on streamlit as below:

uploaded_file = st.sidebar.file_uploader(&quot;Upload a file&quot;, key= &quot;uploaded_file&quot;)
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    compressed_pdf_bytes= compress_pdf(uploaded_file, 2000000)

Even after doing all that, I see the file isn't compressed AT ALL.

This is the output in terminal as u see. The file uploaded (output of and the length of compressed_pdf_bytes is almost same.

#Output of actual file uploaded.
UploadedFile(id=6, name=&#39;abc.pdf&#39;, type=&#39;application/pdf&#39;, size=4588407)

#output of compressed file in bytes
4472714

答案1

得分: 1

If you want to try PyMuPDF: python -m pip install pymupdf.
Then do this

import fitz  # pymupdf

def compress_pdf(pdf_file, target_size):
    # Load PDF using pymupdf
    doc = fitz.open(pdf_file)
    compressed_pdf_bytes = doc.tobytes(
        deflate=True,
        garbage=4,
    )
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf
    return compressed_pdf_bytes

The above solution is lossless!
You can also try to build subset fonts - if your PDFs contain fonts that are not subsetted, the gain may be significant.

If your 2 MB limit is a hard one and you are ready to sacrifice things: PyMuPDF also offers to remove images.
With some more programming effort, you can also replace images by their gray-scale equivalents.

Here is a snippet to remove all images from a PDF:

doc = fitz.open(pdf_file)
for page in doc:
    for xref in [item[0] for item in page.get_images()]:
        page.delete_image(xref)
doc.save("no-images.pdf", garbage=4, deflate=True)
英文:

If you want to try PyMuPDF: python -m pip install pymupdf.
Then do this

import fitz  # pymupdf

def compress_pdf(pdf_file, target_size):
    # Load PDF using pymupdf
    doc = fitz.open(pdf_file)
    compressed_pdf_bytes = doc.tobytes(
        deflate=True,
        garbage=4,
    )
    print(len(compressed_pdf_bytes)) # Check output of compressed pdf
    return compressed_pdf_bytes

The above solution is lossless!
You can also try to build subset fonts - if your PDFs contain fonts that are not subsetted, the gain may be significant.

If your 2 MB limit is a hard one and you are ready to sacrifice things: PyMuPDF also offers to remove images.
With some more programming effort, you can also replace images by their gray-scale equivalents.

Here is a snippet to remove all images from a PDF:

doc = fitz.open(pdf_file)
for page in doc:
    for xref in [item[0] for item in page.get_images()]:
        page.delete_image(xref)
doc.save(&quot;no-images.pdf&quot;, garbage=4, deflate=True)

答案2

得分: 1

你可以尝试 cpdf -squeeze in.pdf -o out.pdf 这个无损压缩方法。

一般来说,如果你的文件被正确创建,压缩的空间会很小。特别是如果文件主要是图片和字体。

英文:

You can try cpdf -squeeze in.pdf -o out.pdf which is lossless compression.

Generally speaking, if your file has been created competently, there should be little space to squeeze. Especially if it is mostly images and fonts.

答案3

得分: 0

以下是您要翻译的内容:

这并不总是可能压缩文件。在某个时候,它有其最小形式。

pypdf关于文件大小的文档提供了一些重要的提示。它们特别指出,移除图像可以减小文件大小。

英文:

It is not always possible to compress the file. At some point, it has its minimal form.

The pypdf docs about file size give some important hints. They especially point out that removing images can reduce the file size.

huangapple
  • 本文由 发表于 2023年5月21日 08:22:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297824.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定