2023年2月10日 04:35:49go评论56阅读模式

英文:

highlight Text in pdf file without using search_for()

问题

我想要使用PyMuPDF库在我的PDF文件中突出显示文本。search_for() 方法返回搜索词的位置。问题在于该方法忽略了空格、大小写，只适用于ASCII字符。

有没有办法在不使用search_for()的情况下获取位置/坐标？

我的代码：

import fitz
import re

pattern = re.compile(r'(\[V2G2-\d{3}\])(\s{1,}\w(.+?)\.  )')
for m in re.finditer(pattern, text):
    macted.append(m.group())

def doHighlight():
    pdf_document = fitz.open("ISO_15.pdf")
    page_num = pdf_document.page_count
    
    for i in range(page_num):
        page = pdf_document[i]

        for item in macted:
            search_instances = page.search_page_for(item, quad=True)
            
            for q in search_instances:
                highlight = page.add_highlight_annot(q)
                highlight.set_colors({"stroke": (0.5, 1, 1), "fill": (0.75, 0.8, 0.95)})
                highlight.update()
    pdf_document.save("output.pdf")

它会忽略第二个句子，因为单词之间的空格。

英文:

I would like to highlight text in my pdf file by using PyMuPDF library.
The method search_for() return the location of the searched words.
the problem is this method ignore spaces. Upper / lower case.it works only for ASCII characters.

is there any solution to get the location\coordinate without using search_for()

my Code:

pattern=re.compile(r&#39;(\[V2G2-\d{3}\])(\s{1,}\w(.+?)\.  )&#39;)
for m in re.finditer(pattern,text):
     macted.append(m.group())

def doHighleigh():
    pdf_document = fitz.open(&quot;ISO_15.pdf&quot;)
    page_num = pdf_document.page_count
    
    for i in range(page_num):
        page = pdf_document[i]

        for item in macted:
            search_instances = page.search_page_for(item,quad=True)
            
            for q in search_instances:
                highlight = page.add_highlight_annot(q)
                #RGB(127, 255, 255)
                highlight.set_colors({&quot;stroke&quot;: (0.5, 1, 1), &quot;fill&quot;: (0.75, 0.8, 0.95)})
                highlight.update()
    pdf_document.save(r&quot;output.pdf&quot;)

it igone the sec. sentence because the spaces between the words.

答案1

得分: 1

使用搜索方法只是获取需要进行突出显示的坐标的一种方式。您还可以使用任何返回文本坐标的 page.get_text() 变体。根据您的示例，"blocks" 变体可能足够，或者可以使用 "words" 和 "blocks" 提取的组合。

page.get_text("blocks") 返回一个类似于 (x0, y0, x1, y1, "line1\nline2\n, ...", blocknumber, blocktype) 的项目列表。元组中的前4个项目是包围矩形的坐标。

page.get_text("words") 您还可以提取包含没有空格的单词（字符串）的列表，其格式类似：(x0, y0, x1, y1, "wordstring", blocknumber, linenumber, wordnumber)。

您可以检查 "words" 中是否有与正则表达式模式匹配的项目，然后突出显示相应的块。可能甚至可以在不使用正则表达式的情况下完成。以下是可能满足您意图的一小段代码：

def matches(word):
    if word.startswith("[V2G2-") and word.endswith(("]", "].")):
        return True
    return False

def add_highlight(page, rect):
    """突出显示注释没有填充颜色"""
    annot = page.add_highlight_annot(rect)
    annot.set_colors(stroke=(0.5,1,1))
    annot.update()

flags = fitz.TEXTFLAGS_TEXT  # 需要所有提取的相同标志
for page in doc:
    blocks = page.get_text("blocks", flags=flags)
    words = page.get_text("words", flags=flags)
    for word in words:
        blockn = word[-3]  # 块编号
        if matches(word[4]):
            block = blocks[blockn]  # 获取包含块
            block_rect = fitz.Rect(block[:4])
            add_highlight(page, block_rect)

因此，这里使用的方法是：检查块是否包含匹配的单词。如果是，则突出显示它。

英文:

Using the search method is just one way to get hold of coordinates required for highlighting. You can also use any of the page.get_text() variants returning text coordinates. Looking at your example, the "blocks" variant may be sufficient, or a combination of "words" and "blocks" extractions.

page.get_text("blocks") returns a list of items like (x0, y0, x1, y1, "line1\nline2\n, ...", blocknumber, blocktype). The first 4 items in the tuple are the coordinates of the enveloping rectangle.

page.get_text("words") You also can extract a list of words (strings containing no spaces) with similar items: (x0, y0, x1, y1, "wordstring", blocknumber, linenumber, wordnumber).

You could inspect the "words" for items matching the regex pattern and then highlight the respective block. Probably can even be done without regular expressions. Here is a snippet that may serve your intention:

def matches(word):
    if word.startswith(&quot;[V2G2-&quot;) and word.endswith((&quot;]&quot;, &quot;].&quot;)):
        return True
    return False

def add_highlight(page, rect):
    &quot;&quot;&quot;Highlight annots have no fill color&quot;&quot;&quot;
    annot = page.add_highlight_annot(rect)
    annot.set_colors(stroke=(0.5,1,1))
    annot.update()

flags = fitz.TEXTFLAGS_TEXT  # need identical flags for all extractions
for page in doc:
    blocks = page.get_text(&quot;blocks&quot;, flags=flags)
    words = page.get_text(&quot;words&quot;, flags=flags)
    for word in words:
        blockn = word[-3]  # block number
        if matches(word[4]):
            block = blocks[blockn]  # get the containing block
            block_rect = fitz.Rect(block[:4])
            add_highlight(page, block_rect)

So the approach used here is: check if a block contains a matching word. If so, highlight it.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在不使用`search_for()`的情况下在PDF文件中突出显示文本。

问题

答案1

用Pandas在Python中重塑和清理制表符分隔的数据文件

如何将文本添加到图像顶部？

无法导入模块 ‘lambda_function’: 无法从 ‘lxml’ 导入名称 ‘etree’ aws lambda

mpirun, Python, and task mapping

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论