Highlight python-docx with regex and spacy.

huangapple go评论94阅读模式
英文:

Highlight python-docx with regex and spacy

问题

I want to highlight regex pattern in docx files in a folder using python-docx. I am able to achieve it through the normal regex code below.

Issue comes when I want to achieve the same through spacy nlp.

  1. from docx import Document
  2. from docx.enum.text import WD_COLOR_INDEX
  3. import pandas as pd
  4. import os
  5. import re
  6. import spacy
  7. nlp = spacy.load("en_core_web_sm")
  8. path = r"/home/coder/Documents/"
  9. doc1 = Document('test.docx')
  10. doc = nlp(doc1)
  11. # re_highlight = re.compile(r"[1-9][0-9]*|0") # This one works.
  12. re_highlight = [token for token in doc if token.like_num == "TRUE"]
  13. for filename in os.listdir(path):
  14. if filename.endswith(".docx"):
  15. file = "/home/writer/Documents/" + filename
  16. print(file)
  17. for para in doc.paragraphs:
  18. text = para.text
  19. if len(re_highlight.findall(text)) > 0:
  20. matches = re_highlight.finditer(text)
  21. para.text = ''
  22. p3 = 0
  23. for match in matches:
  24. p1 = p3
  25. p2, p3 = match.span()
  26. para.add_run(text[p1:p2])
  27. run = para.add_run(text[p2:p3])
  28. run.font.highlight_color = WD_COLOR_INDEX.YELLOW
  29. para.add_run(text[p3:])
  30. doc.save(file)

Error:

raise ValueError(Errors.E1041.format(type=type(doc_like)))
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'docx.document.Document'>

I realize the doc doesn't have doc.paragraphs being nlp element. How to sort this problem?

Kindly help.

英文:

I want to highlight regex pattern in docx files in a folder using python-docx. I am able to achieve it through the normal regex code below.

Issue comes when I want to achieve the same through spacy nlp.

  1. from docx import Document
  2. from docx.enum.text import WD_COLOR_INDEX
  3. import pandas as pd
  4. import os
  5. import re
  6. import spacy
  7. nlp = spacy.load(&quot;en_core_web_sm&quot;)
  8. path = r&quot;/home/coder/Documents/&quot;
  9. doc1 = Document(&#39;test.docx&#39;)
  10. doc = nlp(doc1)
  11. #re_highlight = re.compile(r&quot;[1-9][0-9]*|0&quot;) # This one works.
  12. re_highlight = [token for token in doc if tok.like_num == &quot;TRUE&quot;]
  13. for filename in os.listdir(path):
  14. if filename.endswith(&quot;.docx&quot;):
  15. file = &quot;/home/writer/Documents/&quot; + filename
  16. print(file)
  17. for para in doc.paragraphs:
  18. text = para.text
  19. if len(re_highlight.findall(text)) &gt; 0:
  20. matches = re_highlight.finditer(text)
  21. para.text = &#39;&#39;
  22. p3 = 0
  23. for match in matches:
  24. p1 = p3
  25. p2, p3 = match.span()
  26. para.add_run(text[p1:p2])
  27. run = para.add_run(text[p2:p3])
  28. run.font.highlight_color = WD_COLOR_INDEX.YELLOW
  29. para.add_run(text[p3:])
  30. doc.save(file)

Error:

>raise ValueError(Errors.E1041.format(type=type(doc_like)))
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'docx.document.Document'>

I realize the doc doesn't have doc.paragraphs being nlp element. How to sort this problem?

Kindly help.

答案1

得分: 1

你不能对 doc1 使用 nlp(doc1),因为 doc1 是一个 Document 对象,你需要提取文本部分并对其进行操作。我建议尝试类似以下的方法(在示例文件中有效):

  1. import re
  2. from pathlib import Path
  3. import spacy
  4. from docx import Document
  5. from docx.enum.text import WD_COLOR_INDEX
  6. nlp = spacy.load("en_core_web_sm")
  7. def highlight(text):
  8. tokens = (token.text for token in nlp(text) if token.like_num)
  9. return re.compile("|".join(sorted(tokens, key=len, reverse=True))
  10. path_in = Path("/home/coder/Documents/") # 输入文件夹
  11. path_out = Path("/home/writer/Documents/") # 输出文件夹
  12. for file in path_in.glob("*.docx"):
  13. print(f"处理文件 '{file}' ... ", end="")
  14. doc = Document(file)
  15. for para in doc.paragraphs:
  16. text = para.text
  17. para.text = ""
  18. p3 = 0
  19. for match in highlight(text).finditer(text):
  20. p1 = p3
  21. p2, p3 = match.span()
  22. para.add_run(text[p1:p2])
  23. run = para.add_run(text[p2:p3])
  24. run.font.highlight_color = WD_COLOR_INDEX.YELLOW
  25. para.add_run(text[p3:])
  26. doc.save(path_out / file.name)
  27. print("完成。")

存在意外高亮的可能性。如果发生这种情况,你可以尝试使用以下方法:

  1. def highlight(text):
  2. tokens = (token.text for token in nlp(text) if token.like_num)
  3. pat = r"\b(?:" + "|".join(sorted(tokens, key=len, reverse=True)) + r")\b"
  4. return re.compile(pat)

希望对你有所帮助。

英文:

You can't do nlp(doc1) with doc1 being a Document object, you have to extract the text parts and work with them. I'd suggest something like the following instead (worked here for a sample file):

  1. import re
  2. from pathlib import Path
  3. import spacy
  4. from docx import Document
  5. from docx.enum.text import WD_COLOR_INDEX
  6. nlp = spacy.load(&quot;en_core_web_sm&quot;)
  7. def highlight(text):
  8. tokens = (token.text for token in nlp(text) if token.like_num)
  9. return re.compile(&quot;|&quot;.join(sorted(tokens, key=len, reverse=True)))
  10. path_in = Path(&quot;/home/coder/Documents/&quot;) # Input folder
  11. path_out = Path(&quot;/home/writer/Documents/&quot;) # Output folder
  12. for file in path_in.glob(&quot;*.docx&quot;):
  13. print(f&quot;Processing file &#39;{file}&#39; ... &quot;, end=&quot;&quot;)
  14. doc = Document(file)
  15. for para in doc.paragraphs:
  16. text = para.text
  17. para.text = &quot;&quot;
  18. p3 = 0
  19. for match in highlight(text).finditer(text):
  20. p1 = p3
  21. p2, p3 = match.span()
  22. para.add_run(text[p1:p2])
  23. run = para.add_run(text[p2:p3])
  24. run.font.highlight_color = WD_COLOR_INDEX.YELLOW
  25. para.add_run(text[p3:])
  26. doc.save(path_out / file.name)
  27. print(&quot;done.&quot;)

There's a chance of accidental highlighting. If that happens, you could try to use

  1. def highlight(text):
  2. tokens = (token.text for token in nlp(text) if token.like_num)
  3. pat = r&quot;\b(?:&quot; + &quot;|&quot;.join(sorted(tokens, key=len, reverse=True)) + r&quot;)\b&quot;
  4. return re.compile(pat)

instead.

huangapple
  • 本文由 发表于 2023年4月11日 08:30:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75981636.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定