Python – 使用BeautifulSoup从EML文件中提取URL

huangapple go评论128阅读模式
英文:

Python - How to Pull URLs From EML Files with BeautifulSoup

问题

Sure, here's the translated code portion:

  1. 我正在尝试读取一个EML文件然后提取其中的所有URL
  2. 我有两个方法body_to_text() EML中获取正文使用BytesParserSoupfind_links() 获取正文并使用正则表达式查找URL
  3. 对于大多数样本我已经使它们正常工作了但是当使用Soup来解析非多部分文件时当样本包含行尾等号时我遇到了问题
  4. ```python
  5. def body_to_text():
  6. with open("email.eml", "rb") as email_file:
  7. email_message = email.message_from_binary_file(email_file)
  8. if email_message.is_multipart():
  9. with open(self.email, 'rb') as fp:
  10. msg = BytesParser(policy=policy.default).parse(fp)
  11. try:
  12. body_text = msg.get_body(preferencelist=('plain')).get_content().strip()
  13. except AttributeError:
  14. print("No body found")
  15. else:
  16. body_text = body_text.replace("\n", "")
  17. if body_text == "":
  18. print("No body found")
  19. else:
  20. self.find_links(body_text)
  21. else:
  22. body_html = email_message.get_payload()
  23. soup = BeautifulSoup(body_html, "lxml")
  24. find_links(soup)
  25. def find_links(scan_text):
  26. WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()&lt;&gt;{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|
  27. <details>
  28. <summary>英文:</summary>
  29. I&#39;m trying to read an EML file and then pull all URLs within it.
  30. I&#39;ve got two methods: body_to_text() which gets the body from the EML, with either BytesParser or Soup; and find_links() which takes the body and uses a regex to find the URLs.
  31. I&#39;ve got it working for most samples I&#39;ve tried, however when using Soup to parse the non-multipart files, I run into a problem when the sample contains end of line equals signs.

def body_to_text():
with open("email.eml", "rb") as email_file:
email_message = email.message_from_binary_file(email_file)

  1. if email_message.is_multipart():
  2. with open(self.email, &#39;rb&#39;) as fp:
  3. msg = BytesParser(policy=policy.default).parse(fp)
  4. try:
  5. body_text = msg.get_body(preferencelist=(&#39;plain&#39;)).get_content().strip()
  6. except AttributeError:
  7. print(&quot;No body found&quot;)
  8. else:
  9. body_text = body_text.replace(&quot;\n&quot;, &quot;&quot;)
  10. if body_text == &quot;&quot;:
  11. print(&quot;No body found&quot;)
  12. else:
  13. self.find_links(body_text)
  14. else:
  15. body_html = email_message.get_payload()
  16. soup = BeautifulSoup(body_html, &quot;lxml&quot;)
  17. find_links(soup)

def find_links(scan_text):
WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+./)(?:[^\s()<>{}[]]+|([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?))+(?:([^\s()]?([^\s()]+)[^\s()]?)|([^\s]+?)|[^\s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*.\b/?(?!@)))"""
links = re.findall(WEB_URL_REGEX, str(scan_text))

  1. links = list(dict.fromkeys(self.links))
  2. print(f&quot;{len(self.links)} links found&quot;)
  3. print(links)
  1. print(body_html) gives
  2. &gt; ```
  3. &gt; &lt;a href=3D&quot;http://fawper.xyz/corruptly/676197486/trout/gen=
  4. &gt; eralizing/1683814388/upgather/disjoin&quot; style=3D&quot;-webkit-text-size-adjust:no=
  5. &gt; ne;text-decoration:none;&quot;&gt; &lt;font style=3D&quot;-webkit-text-size-adjust:none;fon=
  6. &gt; t-size:15px;
  7. &gt; ```
  8. And print(soup) gives
  9. &gt; ```
  10. &gt; href=&#39;3D&quot;http://fawper.xyz/corruptly/676197486/trout/gen=&#39; ne=&quot;&quot; style=&#39;3D&quot;-webkit-text-size-adjust:no=&#39;&gt; &lt;font style=&#39;3D&quot;-webkit-text-size-adjust:none;fon=&#39; t-size:15px=&quot;&quot;
  11. &gt; ```
  12. So then find_links outputs:
  13. &gt; ```
  14. &gt; &#39;http://fawper.xyz/corruptly/676197486/trout/gen=&#39;
  15. &gt; ```
  16. When I want it to output:
  17. &gt; ```
  18. &gt; &#39;http://fawper.xyz/corruptly/676197486/trout/generalizing/1683814388/upgather/disjoin&#39;
  19. &gt; ```
  20. I&#39;ve tried using html.parser and html5lib in place of lxml, but that didn&#39;t solve it. Could it be the encoding of the specific email that I&#39;m parsing?
  21. </details>
  22. # 答案1
  23. **得分**: 1
  24. 以下是您要翻译的内容:
  25. soup 块与[lastchancexi的答案](https://stackoverflow.com/a/71416428/21936606)的一部分进行交换,该答案使用 Email 模块根据内容类型获取其有效载荷,给了我所期望的输出。
  26. ```python
  27. def body_to_text(self):
  28. text = ""
  29. html = ""
  30. with open(self.email, "rb") as email_file:
  31. email_message = email.message_from_binary_file(email_file)
  32. if not email_message.is_multipart():
  33. content_type = email_message.get_content_type()
  34. if content_type == "text/plain":
  35. text += str(email_message.get_payload(decode=True))
  36. self.find_urls(text)
  37. elif content_type == "text/html":
  38. html += str(email_message.get_payload(decode=True))
  39. self.find_urls(html)
  40. else:
  41. with open(self.email, 'rb') as fp:
  42. msg = BytesParser(policy=policy.default).parse(fp)
  43. try:
  44. body_text = msg.get_body(preferencelist=('plain',)).get_content().strip()
  45. except AttributeError:
  46. print("No body found")
  47. else:
  48. body_text = body_text.replace("\n", "")
  49. if body_text == "":
  50. print("No body found")
  51. else:
  52. self.find_urls(body_text)
英文:

Swapping the soup block with a part of lastchancexi's answer, which used the Email module to get its payload based on the content type, gave me the desired output.

  1. def body_to_text(self):
  2. text = &quot;&quot;
  3. html = &quot;&quot;
  4. with open(self.email, &quot;rb&quot;) as email_file:
  5. email_message = email.message_from_binary_file(email_file)
  6. if not email_message.is_multipart():
  7. content_type = email_message.get_content_type()
  8. if content_type == &quot;text/plain&quot;:
  9. text += str(email_message.get_payload(decode=True))
  10. self.find_urls(text)
  11. elif content_type == &quot;text/html&quot;:
  12. html += str(email_message.get_payload(decode=True))
  13. self.find_urls(html)
  14. else:
  15. with open(self.email, &#39;rb&#39;) as fp:
  16. msg = BytesParser(policy=policy.default).parse(fp)
  17. try:
  18. body_text = msg.get_body(preferencelist=(&#39;plain&#39;)).get_content().strip()
  19. except AttributeError:
  20. print(&quot;No body found&quot;)
  21. else:
  22. body_text = body_text.replace(&quot;\n&quot;, &quot;&quot;)
  23. if body_text == &quot;&quot;:
  24. print(&quot;No body found&quot;)
  25. else:
  26. self.find_urls(body_text)

huangapple
  • 本文由 发表于 2023年5月22日 05:40:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76302036.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定