英文:
BeautifulSoup parse XML with HTML content
问题
I have an XML file (formally XBRL) in which some of the tags contain escaped HTML. I'd like to parse the document an XML, and then extract the HTML from these tags.
However, it appears that the escaped characters are somehow deleted by BeautifulSoup. So when I try to get mytag.text all the escaped characters (e.g. < ;) are not present anymore. For instance:
''&lt;'' in raw_text # True
''&lt;'' in str(BeautifulSoup(raw_text, ''xml')) # False
I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:
raw_text = ''<xmltag><t>&lt;p&gt;test&lt;/p&gt;<t><xmltag>''
soup = BeautifulSoup(raw_text, ''xml')
''&lt;'' in str(soup) # True
So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:
with open('test.xml', 'r') as fp:
raw_text = fp.read()
soup = BeautifulSoup(raw_text, 'xml')
mytag = soup.find('QuarterlyFinancialInformationTextBlock')
print(mytag.text[:100])
# prints: div div style="margin-left:0pt;margin-righ
# original file: &lt;div&gt; &lt;div style=
英文:
I have an XML file (formally XBRL) in which some of the tags contain escaped HTML. I'd like to parse the document an XML, and then extract the HTML from these tags.
However, it appears that the escaped characters are somehow deleted by BeautifulSoup. So when I try to get mytag.text all the escaped characters (e.g. < ;) are not present anymore. For instance:
'&lt;' in raw_text # True
'&lt;' in str(BeautifulSoup(raw_text, 'xml')) # False
I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:
raw_text = '<xmltag><t>&lt;p&gt;test&lt;/p&gt;<t><xmltag>'
soup = BeautifulSoup(raw_text, 'xml')
'&lt;' in str(soup) # True
So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:
with open('test.xml', 'r') as fp:
raw_text = fp.read()
soup = BeautifulSoup(raw_text, 'xml')
mytag = soup.find('QuarterlyFinancialInformationTextBlock')
print(mytag.text[:100])
# prints: div div style="margin-left:0pt;margin-righ
# original file: &lt;div&gt; &lt;div style=
答案1
得分: 0
尝试使用另一个XBRL解析器,即python-xbrl。
检查此链接- Python编写的Xbrl解析器。
英文:
Try to use another parser for XBRL, i.e. python-xbrl
Check this link- Xbrl parser written in Python
答案2
得分: 0
以下是翻译好的部分:
from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc('<xmltag><t><p>test</p></t></xmltag>')
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())
结果:
<p>test</p>
<p>test</p>
<p>test</p>
英文:
Solutions using simplifieddoc
from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc('<xmltag><t>&lt;p&gt;test&lt;/p&gt;</t></xmltag>')
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())
result:
&lt;p&gt;test&lt;/p&gt;
&lt;p&gt;test&lt;/p&gt;
<p>test</p>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论