英文:
highlight Text in pdf file without using search_for()
问题
我想要使用PyMuPDF库在我的PDF文件中突出显示文本。search_for() 方法返回搜索词的位置。问题在于该方法忽略了空格、大小写,只适用于ASCII字符。
有没有办法在不使用search_for()的情况下获取位置/坐标?
我的代码:
import fitz
import re
pattern = re.compile(r'(\[V2G2-\d{3}\])(\s{1,}\w(.+?)\. )')
for m in re.finditer(pattern, text):
macted.append(m.group())
def doHighlight():
pdf_document = fitz.open("ISO_15.pdf")
page_num = pdf_document.page_count
for i in range(page_num):
page = pdf_document[i]
for item in macted:
search_instances = page.search_page_for(item, quad=True)
for q in search_instances:
highlight = page.add_highlight_annot(q)
highlight.set_colors({"stroke": (0.5, 1, 1), "fill": (0.75, 0.8, 0.95)})
highlight.update()
pdf_document.save("output.pdf")
它会忽略第二个句子,因为单词之间的空格。
英文:
I would like to highlight text in my pdf file by using PyMuPDF library.
The method search_for() return the location of the searched words.
the problem is this method ignore spaces. Upper / lower case.it works only for ASCII characters.
is there any solution to get the location\coordinate without using search_for()
my Code:
pattern=re.compile(r'(\[V2G2-\d{3}\])(\s{1,}\w(.+?)\. )')
for m in re.finditer(pattern,text):
macted.append(m.group())
def doHighleigh():
pdf_document = fitz.open("ISO_15.pdf")
page_num = pdf_document.page_count
for i in range(page_num):
page = pdf_document[i]
for item in macted:
search_instances = page.search_page_for(item,quad=True)
for q in search_instances:
highlight = page.add_highlight_annot(q)
#RGB(127, 255, 255)
highlight.set_colors({"stroke": (0.5, 1, 1), "fill": (0.75, 0.8, 0.95)})
highlight.update()
pdf_document.save(r"output.pdf")
it igone the sec. sentence because the spaces between the words.
答案1
得分: 1
使用搜索方法只是获取需要进行突出显示的坐标的一种方式。您还可以使用任何返回文本坐标的 page.get_text()
变体。根据您的示例,"blocks" 变体可能足够,或者可以使用 "words" 和 "blocks" 提取的组合。
page.get_text("blocks")
返回一个类似于 (x0, y0, x1, y1, "line1\nline2\n, ...", blocknumber, blocktype)
的项目列表。元组中的前4个项目是包围矩形的坐标。
page.get_text("words")
您还可以提取包含没有空格的单词(字符串)的列表,其格式类似:(x0, y0, x1, y1, "wordstring", blocknumber, linenumber, wordnumber)
。
您可以检查 "words" 中是否有与正则表达式模式匹配的项目,然后突出显示相应的块。可能甚至可以在不使用正则表达式的情况下完成。以下是可能满足您意图的一小段代码:
def matches(word):
if word.startswith("[V2G2-") and word.endswith(("]", "].")):
return True
return False
def add_highlight(page, rect):
"""突出显示注释没有填充颜色"""
annot = page.add_highlight_annot(rect)
annot.set_colors(stroke=(0.5,1,1))
annot.update()
flags = fitz.TEXTFLAGS_TEXT # 需要所有提取的相同标志
for page in doc:
blocks = page.get_text("blocks", flags=flags)
words = page.get_text("words", flags=flags)
for word in words:
blockn = word[-3] # 块编号
if matches(word[4]):
block = blocks[blockn] # 获取包含块
block_rect = fitz.Rect(block[:4])
add_highlight(page, block_rect)
因此,这里使用的方法是:检查块是否包含匹配的单词。如果是,则突出显示它。
英文:
Using the search method is just one way to get hold of coordinates required for highlighting. You can also use any of the page.get_text()
variants returning text coordinates. Looking at your example, the "blocks" variant may be sufficient, or a combination of "words" and "blocks" extractions.
page.get_text("blocks")
returns a list of items like (x0, y0, x1, y1, "line1\nline2\n, ...", blocknumber, blocktype)
. The first 4 items in the tuple are the coordinates of the enveloping rectangle.
page.get_text("words")
You also can extract a list of words (strings containing no spaces) with similar items: (x0, y0, x1, y1, "wordstring", blocknumber, linenumber, wordnumber)
.
You could inspect the "words" for items matching the regex pattern and then highlight the respective block. Probably can even be done without regular expressions. Here is a snippet that may serve your intention:
def matches(word):
if word.startswith("[V2G2-") and word.endswith(("]", "].")):
return True
return False
def add_highlight(page, rect):
"""Highlight annots have no fill color"""
annot = page.add_highlight_annot(rect)
annot.set_colors(stroke=(0.5,1,1))
annot.update()
flags = fitz.TEXTFLAGS_TEXT # need identical flags for all extractions
for page in doc:
blocks = page.get_text("blocks", flags=flags)
words = page.get_text("words", flags=flags)
for word in words:
blockn = word[-3] # block number
if matches(word[4]):
block = blocks[blockn] # get the containing block
block_rect = fitz.Rect(block[:4])
add_highlight(page, block_rect)
So the approach used here is: check if a block contains a matching word. If so, highlight it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论