英文:
module 'PyPDF2' has no attribute 'ContentStream' error
问题
以下是代码的翻译部分:
我正在尝试运行以下代码来替换PDF文件中的文本:
import os
import re
import PyPDF2
from io import StringIO
# 定义一个替换PDF文件中文本的函数
def replace_text_in_pdf(input_pdf_path, output_pdf_path, search_text, replace_text):
# 以二进制读取模式打开输入的PDF文件
with open(input_pdf_path, 'rb') as input_file:
# 创建一个PDF阅读器对象
pdf_reader = PyPDF2.PdfReader(input_file)
# 创建一个PDF写入器对象
pdf_writer = PyPDF2.PdfWriter()
# 遍历PDF的每一页
for page_num in range(len(pdf_reader.pages)):
# 获取页面对象
page = pdf_reader.pages[page_num]
# 获取页面的文本内容
text = page.extract_text()
# 用替换文本替换搜索文本
new_text = re.sub(search_text, replace_text, text)
# 创建一个新页面,带有替换后的文本
new_page = PyPDF2.PageObject.create_blank_page(None, page.mediabox.width, page.mediabox.height)
new_page.merge_page(page) # 复制原始页面内容到新页面
new_page.add_transformation(PyPDF2.Transformation().translate(0, 0).scale(1, 1)) # 重置变换矩阵
# 开始文本对象
new_page._text = PyPDF2.ContentStream(new_page.pdf)
new_page._text.beginText()
# 设置字体和字体大小
new_page._text.setFont("Helvetica", 12)
# 在页面上绘制新文本
x, y = 100, 100 # 替换为新文本的所需位置
new_page._text.setFontSize(12)
new_page._text.textLine(x, y, new_text)
# 结束文本对象
new_page._text.endText()
# 将新页面添加到PDF写入器对象
pdf_writer.addPage(new_page)
# 保存新的PDF文件
with open(output_pdf_path, 'wb') as output_file:
pdf_writer.write(output_file)
# 调用替换PDF文件中文本的函数
input_pdf_path = r'D:\file1.pdf' # 替换为您的输入PDF文件路径
output_pdf_path = r'D:\file1_replaced.pdf' # 替换为您的输出PDF文件路径
search_text = '<FirstName>' # 替换为要替换的文本
replace_text = 'John' # 替换为要替换的文本内容
replace_text_in_pdf(input_pdf_path, output_pdf_path, search_text, replace_text)
然而,代码中的这一行:`new_page._text = PyPDF2.ContentStream(new_page.pdf)` 出现了以下错误:`模块 'PyPDF2' 没有属性 'ContentStream'`。请问有谁能帮助我如何修复它?
英文:
I am trying to run the following code to replace text inside a PDF file:
import os
import re
import PyPDF2
from io import StringIO
# Define a function to replace text in a PDF file
def replace_text_in_pdf(input_pdf_path, output_pdf_path, search_text, replace_text):
# Open the input PDF file in read-binary mode
with open(input_pdf_path, 'rb') as input_file:
# Create a PDF reader object
pdf_reader = PyPDF2.PdfReader(input_file)
# Create a PDF writer object
pdf_writer = PyPDF2.PdfWriter()
# Iterate through each page of the PDF
for page_num in range(len(pdf_reader.pages)):
# Get the page object
page = pdf_reader.pages[page_num]
# Get the text content of the page
text = page.extract_text()
# Replace the search text with the replace text
new_text = re.sub(search_text, replace_text, text)
# Create a new page with the replaced text
new_page = PyPDF2.PageObject.create_blank_page(None, page.mediabox.width, page.mediabox.height)
new_page.merge_page(page) # Copy the original page content to the new page
new_page.add_transformation(PyPDF2.Transformation().translate(0, 0).scale(1, 1)) # Reset the transformation matrix
# Begin the text object
new_page._text = PyPDF2.ContentStream(new_page.pdf)
new_page._text.beginText()
# Set the font and font size
new_page._text.setFont("Helvetica", 12)
# Draw the new text on the page
x, y = 100, 100 # Replace with the desired position of the new text
new_page._text.setFontSize(12)
new_page._text.textLine(x, y, new_text)
# End the text object
new_page._text.endText()
# Add the new page to the PDF writer object
pdf_writer.addPage(new_page)
# Save the new PDF file
with open(output_pdf_path, 'wb') as output_file:
pdf_writer.write(output_file)
# Call the function to replace text in a PDF file
input_pdf_path = r'D:\file1.pdf' # Replace with your input PDF file path
output_pdf_path = r'D:\file1_replaced.pdf' # Replace with your output PDF file path
search_text = '<FirstName>' # Replace with the text you want to replace
replace_text = 'John' # Replace with the text you want to replace it with
replace_text_in_pdf(input_pdf_path, output_pdf_path, search_text, replace_text)
However, line: new_page._text = PyPDF2.ContentStream(new_page.pdf)
is giving me the following error: module 'PyPDF2' has no attribute 'ContentStream'
.
Can someone help how to fix it?
答案1
得分: 1
In PyPDF2 document there is no ContentStream
property, so you can't use it directly to create a new text object.
PyPDF2 module does not have the property ContentStream, because it is an internal class, not a public API, so you can not directly import it. You need to use PdfFileWriter's _addObject method to add a new ContentStream object, and then use PdfFileWriter's _updateObject method to update the contents of the page.
You can refer to this Stack Overflow answer, which has a sample code, you can add on the PDF watermark function, you can modify a down to achieve the function of replacing the text.
英文:
In PyPDF2 document there is no ContentStream
property, so you can't use it directly to create a new text object.
> PyPDF2 module does not have the property ContentStream, because it is an internal class, not a public API, so you can not directly import it. You need to use PdfFileWriter's _addObject method to add a new ContentStream object, and then use PdfFileWriter's _updateObject method to update the contents of the page.
You can refer to this Stack Overflow answer, which has a sample code, you can add on the PDF watermark function, you can modify a down to achieve the function of replacing the text.
答案2
得分: 0
你在这里遇到AttributeError
的问题很简单:你正在使用的库并不是设计用来像你现在这样修改和写入PDF文件的。
pypdf是一个免费的开源纯Python PDF库,能够拆分、合并、裁剪和转换PDF文件的页面。它还可以向PDF文件添加自定义数据、查看选项和密码。pypdf还可以从PDF中检索文本和元数据。
因此,这个库的主要重点不是修改PDF内部的文本。也许可以通过遵循已经在这里提到的示例来实现。你可以尝试一下,看看是否有帮助。不过,如果你没有真正的文本,可能会遇到一些障碍。
不清楚你是怎么得到这段代码的。ContentStream
对象根本不存在(至少没有begin_text()
属性)。可能这是来自另一个库或可能来自这个分支的代码,它在pdf
下提供了ContentStream
,即PyPDF4.pdf.ContentStream
。无论如何,在我看来,所有不同版本的PyPDF库都没有这个,也没有begin_text()
。
为了最终修复你的代码,你有几种可能性。可以尝试这个SO中已经提到的解决方案,像这样:
for page in pdf_reader.pages:
data = page.get_contents().get_data()
data.replace(search_text.encode("utf-8"), replace_text.encode("utf-8"))
page.get_contents().set_data(data)
pdf_writer.add_page(page)
或者尝试不仅仅使用pypdf(2)来实现你的目标。还有一些其他可能性可以尝试:
只是作为一个附带说明:PyPDF2正在回归到根本,即自版本3.1.0以来再次得到维护(请参阅注释)。所以,希望将来不再会因不同版本和分支而引起混淆。
英文:
You get an AttributeError
here for a simple reason: the library you are using is not designed to modify and write PDF files like you're doing.
> pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.
This is true for pydf, PyPDF2 and also for PyPDF3.
So, the main focus of this library is not modifying text inside the pdf. Maybe it is somehow possible by following the examples already mentioned here. You can try out and see if this will help. I see some obstacles if you do not have real text, though.
It is absolutely unclear how you came to this code snippet. The ContentStream
object simply does not exist (at least not with a begin_text()
attribute). Presumably it is a piece of code from another library or possibly from this fork that provides ContentStream
under pdf
, i.e. PyPDF4.pdf.ContentStream
. In any case, the PyPDF libraries in all the variants do not have this along with begin_text()
as far as I can see.
To finally fix your code, you have several possibilities. Try the already mentioned solution from this SO like this
for page in pdf_reader.pages:
data = page.get_contents().get_data()
data.replace(search_text.encode("utf-8"), replace_text.encode("utf-8"))
page.get_contents().set_data(data)
pdf_writer.add_page(page)
or try to achieve your goal not with pypdf(2) alone. Here are some other possibilities you can try out:
Just as a side note: PyPDF2 is going back to the roots, i.e. pypdf is maintained again since version 3.1.0 (see notes). So hopefully, no confusions any more in the future about the different versions and forks.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论