BeautifulSoup解析具有HTML内容的XML。

huangapple go评论74阅读模式
英文:

BeautifulSoup parse XML with HTML content

问题

I have an XML file (formally XBRL) in which some of the tags contain escaped HTML. I'd like to parse the document an XML, and then extract the HTML from these tags.
However, it appears that the escaped characters are somehow deleted by BeautifulSoup. So when I try to get mytag.text all the escaped characters (e.g. &lt ;) are not present anymore. For instance:

''<'' in raw_text # True
''<'' in str(BeautifulSoup(raw_text, ''xml')) # False

I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:

raw_text = ''<xmltag><t><p>test</p><t><xmltag>''
soup = BeautifulSoup(raw_text, ''xml')
''<'' in str(soup) # True 

So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:

with open('test.xml', 'r') as fp:
    raw_text = fp.read()
soup = BeautifulSoup(raw_text, 'xml')
mytag = soup.find('QuarterlyFinancialInformationTextBlock')
print(mytag.text[:100])
# prints:            div div style="margin-left:0pt;margin-righ
# original file:     <div> <div style=
英文:

I have an XML file (formally XBRL) in which some of the tags contain escaped HTML. I'd like to parse the document an XML, and then extract the HTML from these tags.
However, it appears that the escaped characters are somehow deleted by BeautifulSoup. So when I try to get mytag.text all the escaped characters (e.g. &lt ;) are not present anymore. For instance:

'<' in raw_text # True
'<' in str(BeautifulSoup(raw_text, 'xml')) # False

I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:

raw_text = '<xmltag><t><p>test</p><t><xmltag>'
soup = BeautifulSoup(raw_text, 'xml')
'<' in str(soup) # True 

So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:

with open('test.xml', 'r') as fp:
    raw_text = fp.read()
soup = BeautifulSoup(raw_text, 'xml')
mytag = soup.find('QuarterlyFinancialInformationTextBlock')
print(mytag.text[:100])
# prints:            div div style="margin-left:0pt;margin-righ
# original file:     <div> <div style=

答案1

得分: 0

尝试使用另一个XBRL解析器,即python-xbrl。

检查此链接- Python编写的Xbrl解析器

英文:

Try to use another parser for XBRL, i.e. python-xbrl

Check this link- Xbrl parser written in Python

答案2

得分: 0

以下是翻译好的部分:

from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc('<xmltag><t>&lt;p&gt;test&lt;/p&gt;</t></xmltag>')
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())

结果:

&lt;p&gt;test&lt;/p&gt;
&lt;p&gt;test&lt;/p&gt;
<p>test</p>
英文:

Solutions using simplifieddoc

from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc(&#39;&lt;xmltag&gt;&lt;t&gt;&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;&lt;/t&gt;&lt;/xmltag&gt;&#39;)
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())

result:

&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;
&amp;lt;p&amp;gt;test&amp;lt;/p&amp;gt;
&lt;p&gt;test&lt;/p&gt;

huangapple
  • 本文由 发表于 2020年1月7日 00:33:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/59615697.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定