英文:
PyPDF2 unable to compress pdf
问题
I want to show an embed pdf on streamlit app which has a limitation of <2MB pdf size to be displayed.
So I am trying to compress the pdf file which a user uploads via st.file_uploader on streamit app using PyPDF2 package. Here's the code I used:
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
def compress_pdf(pdf_file, target_size):
# Load PDF using PyPDF2
pdf_reader = PdfReader(pdf_file)
# Compress PDF using PyPDF2
output_pdf = BytesIO()
pdf_writer = PdfWriter()
for page in pdf_reader.pages:
page.compress_content_streams() # This is CPU intensive!
pdf_writer.add_page(page)
# Get the compressed PDF bytes
pdf_writer.write(output_pdf)
compressed_pdf_bytes = output_pdf.getvalue()
print(len(compressed_pdf_bytes)) # Check output of compressed pdf
return compressed_pdf_bytes
The above function takes in the file uploaded by the user on streamlit as below:
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)
if uploaded_file is not None:
compressed_pdf_bytes= compress_pdf(uploaded_file, 2000000)
Even after doing all that, I see the file isn't compressed AT ALL.
This is the output in terminal as u see. The file uploaded (output of and the length of compressed_pdf_bytes is almost same.
#Output of actual file uploaded.
UploadedFile(id=6, name='abc.pdf', type='application/pdf', size=4588407)
#output of compressed file in bytes
4472714
英文:
I want to show an embed pdf on streamlit app which has a limitation of <2MB pdf size to be displayed.
So I am trying to compress the pdf file which a user uploads via st.file_uploader on streamit app using PyPDF2 package. Here's the code I used:
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
def compress_pdf(pdf_file, target_size):
# Load PDF using PyPDF2
pdf_reader = PdfReader(pdf_file)
# Compress PDF using PyPDF2
output_pdf = BytesIO()
pdf_writer = PdfWriter()
for page in pdf_reader.pages:
page.compress_content_streams() # This is CPU intensive!
pdf_writer.add_page(page)
# Get the compressed PDF bytes
pdf_writer.write(output_pdf)
compressed_pdf_bytes = output_pdf.getvalue()
print(len(compressed_pdf_bytes)) # Check output of compressed pdf
return compressed_pdf_bytes
The above function takes in the file uploaded by the user on streamlit as below:
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)
if uploaded_file is not None:
compressed_pdf_bytes= compress_pdf(uploaded_file, 2000000)
Even after doing all that, I see the file isn't compressed AT ALL.
This is the output in terminal as u see. The file uploaded (output of and the length of compressed_pdf_bytes is almost same.
#Output of actual file uploaded.
UploadedFile(id=6, name='abc.pdf', type='application/pdf', size=4588407)
#output of compressed file in bytes
4472714
答案1
得分: 1
If you want to try PyMuPDF: python -m pip install pymupdf
.
Then do this
import fitz # pymupdf
def compress_pdf(pdf_file, target_size):
# Load PDF using pymupdf
doc = fitz.open(pdf_file)
compressed_pdf_bytes = doc.tobytes(
deflate=True,
garbage=4,
)
print(len(compressed_pdf_bytes)) # Check output of compressed pdf
return compressed_pdf_bytes
The above solution is lossless!
You can also try to build subset fonts - if your PDFs contain fonts that are not subsetted, the gain may be significant.
If your 2 MB limit is a hard one and you are ready to sacrifice things: PyMuPDF also offers to remove images.
With some more programming effort, you can also replace images by their gray-scale equivalents.
Here is a snippet to remove all images from a PDF:
doc = fitz.open(pdf_file)
for page in doc:
for xref in [item[0] for item in page.get_images()]:
page.delete_image(xref)
doc.save("no-images.pdf", garbage=4, deflate=True)
英文:
If you want to try PyMuPDF: python -m pip install pymupdf
.
Then do this
import fitz # pymupdf
def compress_pdf(pdf_file, target_size):
# Load PDF using pymupdf
doc = fitz.open(pdf_file)
compressed_pdf_bytes = doc.tobytes(
deflate=True,
garbage=4,
)
print(len(compressed_pdf_bytes)) # Check output of compressed pdf
return compressed_pdf_bytes
The above solution is lossless!
You can also try to build subset fonts - if your PDFs contain fonts that are not subsetted, the gain may be significant.
If your 2 MB limit is a hard one and you are ready to sacrifice things: PyMuPDF also offers to remove images.
With some more programming effort, you can also replace images by their gray-scale equivalents.
Here is a snippet to remove all images from a PDF:
doc = fitz.open(pdf_file)
for page in doc:
for xref in [item[0] for item in page.get_images()]:
page.delete_image(xref)
doc.save("no-images.pdf", garbage=4, deflate=True)
答案2
得分: 1
你可以尝试 cpdf -squeeze in.pdf -o out.pdf
这个无损压缩方法。
一般来说,如果你的文件被正确创建,压缩的空间会很小。特别是如果文件主要是图片和字体。
英文:
You can try cpdf -squeeze in.pdf -o out.pdf
which is lossless compression.
Generally speaking, if your file has been created competently, there should be little space to squeeze. Especially if it is mostly images and fonts.
答案3
得分: 0
以下是您要翻译的内容:
这并不总是可能压缩文件。在某个时候,它有其最小形式。
pypdf关于文件大小的文档提供了一些重要的提示。它们特别指出,移除图像可以减小文件大小。
英文:
It is not always possible to compress the file. At some point, it has its minimal form.
The pypdf docs about file size give some important hints. They especially point out that removing images can reduce the file size.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论