2020年1月7日 00:33:06go评论107阅读模式

英文:

BeautifulSoup parse XML with HTML content

问题

I have an XML file (formally XBRL) in which some of the tags contain escaped HTML. I'd like to parse the document an XML, and then extract the HTML from these tags.
However, it appears that the escaped characters are somehow deleted by BeautifulSoup. So when I try to get mytag.text all the escaped characters (e.g. &lt ;) are not present anymore. For instance:

'&#39;&amp;lt;&#39;' in raw_text # True
'&#39;&amp;lt;&#39;' in str(BeautifulSoup(raw_text, '&#39;xml&#39;)) # False

I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:

raw_text = '&#39;&lt;xmltag&gt;&lt;t&gt;&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;&lt;t&gt;&lt;xmltag&gt;&#39;'
soup = BeautifulSoup(raw_text, '&#39;xml&#39;)
'&#39;&amp;lt;&#39;' in str(soup) # True

So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:

with open('test.xml', 'r') as fp:
    raw_text = fp.read()
soup = BeautifulSoup(raw_text, 'xml')
mytag = soup.find('QuarterlyFinancialInformationTextBlock')
print(mytag.text[:100])
# prints:            div div style=&quot;margin-left:0pt;margin-righ
# original file:     &amp;lt;div&amp;gt; &amp;lt;div style=

英文:

&#39;&amp;lt;&#39; in raw_text # True
&#39;&amp;lt;&#39; in str(BeautifulSoup(raw_text, &#39;xml&#39;)) # False

I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:

raw_text = &#39;&lt;xmltag&gt;&lt;t&gt;&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;&lt;t&gt;&lt;xmltag&gt;&#39;
soup = BeautifulSoup(raw_text, &#39;xml&#39;)
&#39;&amp;lt;&#39; in str(soup) # True

So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:

with open(&#39;test.xml&#39;, &#39;r&#39;) as fp:
    raw_text = fp.read()
soup = BeautifulSoup(raw_text, &#39;xml&#39;)
mytag = soup.find(&#39;QuarterlyFinancialInformationTextBlock&#39;)
print(mytag.text[:100])
# prints:            div div style=&quot;margin-left:0pt;margin-righ
# original file:     &amp;lt;div&amp;gt; &amp;lt;div style=

答案1

得分: 0

尝试使用另一个XBRL解析器，即python-xbrl。

检查此链接- Python编写的Xbrl解析器。

英文:

Try to use another parser for XBRL, i.e. python-xbrl

Check this link- Xbrl parser written in Python

答案2

得分: 0

以下是翻译好的部分：

from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc('<xmltag><t>&lt;p&gt;test&lt;/p&gt;</t></xmltag>')
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())

结果：

&lt;p&gt;test&lt;/p&gt;
&lt;p&gt;test&lt;/p&gt;
<p>test</p>

英文:

Solutions using simplifieddoc

from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc(&#39;&lt;xmltag&gt;&lt;t&gt;&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;&lt;/t&gt;&lt;/xmltag&gt;&#39;)
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())

result:

&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;
&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;
&lt;p&gt;test&lt;/p&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

BeautifulSoup解析具有HTML内容的XML。

问题

答案1

答案2

在使用Spring和Apache FreeMarker时访问CSS路径出现问题。

如何在tkinter中创建一个带有名称参数/属性的按钮命令？

Pandas中的分组总计

将子元素的属性直接解析为Go结构体

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。