英文:
Converting adoc to markdown while preserving latex style math equations
问题
我有一组adoc文档,我正在将它们转换为markdown格式。对于大多数文档,我已经成功地进行了转换:
``` sh
asciidoc -b docbook -o temp.xml <infile>
pandoc -f docbook -t markdown_strict --atx-headers --mathjax temp.xml -o <outfile>
然后,我使用一些正则表达式来修复一些破损的图片链接并修复标题。然而,对于内联数学公式,这种方法不起作用。在adoc中,它们的语法是:latexmath:[$some_equation_here$]
,有时多行公式中没有美元符号。
当这些公式转换为DocBook XML时,它们被保留并且格式如下:
<inlineequation>
<alt><![CDATA[$some_equation_here$]]></alt>
<inlinemediaobject><textobject><phrase></phrase></textobject></inlinemediaobject>
</inlineequation>
但是当pandoc将其转换回markdown时,它忽略了这些xml块。在pandoc转换过程中,如何保持markdown可读的公式格式($some_equation_here$)呢?mathjax
扩展似乎无法解决这个问题。
我尝试使用单独的Python正则表达式,使用 re.sub(r'latexmath:\[\$?(.*?)\$?\]', r'$\g<1>$', file_contents)
来保留$,但结果是一些双重转义的文本,然后必须手动修复,而且有时会产生额外的/sup
标签。尝试在XML文件中进行类似操作也产生了类似的结果。
<details>
<summary>英文:</summary>
I have a group of adoc documents that I'm converting to markdown. For most of them I've been able to convert them with:
``` sh
asciidoc -b docbook -o temp.xml <infile>
pandoc -f docbook -t markdown_strict --atx-headers --mathjax temp.xml -o <outfile>
followed by some regex to clean up some broken image links and fix the headers. However, this doesn't work for the in-line math equations. In the adoc they are in the syntax: latexmath:[$some_equation_here$]
sometimes without the dollar signs for multi-line equations.
when this gets turned into the DocBook XML it seems to be preserved and is of the format:
<inlineequation>
<alt><![CDATA[$some_equation_here$]]></alt>
<inlinemediaobject><textobject><phrase></phrase></textobject></inlinemediaobject>
</inlineequation>
but when pandoc converts it back to markdown it ignores these blocks of xml. How can i keep it in a markdown readable equation ($some_equation_here$) format during the pandoc conversion? The mathjax
extension doesn't seem to be helping with this operation.
I tried to use a seperate python regex that would use re.sub(r'latexmath:\[\$?(.*?)\$?\]', r'$\g<1>$', file_contents
to keep the $ but it results in some double escaped text that then has to go be fixed manually as well as not fully working sometimes giving some extra /sup
tags. Trying to do something similar with the XML file resulted in similar results.
答案1
得分: 0
Here is the translated content:
根据 pandoc 代码,DocBook 阅读器期望公式位于 <inlineequation>
元素下的 <mathphrase>
元素中。因此,只需将 <alt>
标签替换为 <mathphrase>
即可使 pandoc 捕捉到公式。一般情况下,这会生成无效的 DocBook XML,因为 <inlineequation>
应该包含 要么 <mathphrase>
要么 <inlinemediaobjects>
,但这对 pandoc 并不重要。
注意,pandoc 会自己插入美元符号,所以这些也应该被移除。上述命令使用 Lua 过滤器 来移除美元符号;unwrap-math.lua
包含以下内容:
function Math (mth)
mth.text = mth.text:gsub('^%$', ''):gsub('%$$', '')
return mth
end
英文:
Looking at the pandoc code it seems that the DocBook reader expects the formula to be in an <mathphrase>
element below <inlineequation>
. Thus, replacing the <alt>
tags with <mathphrase>
is enough to get the equation to be picked up by pandoc. This yields invalid DocBook XML in general, as the <inlineequation>
should contain either a <mathphrase>
or <inlinemediaobjects>
, but that doesn't matter for pandoc.
cat << EOF | pandoc --from=docbook --to markdown --lua-filter=unwrap-math.lua
<para>
<inlineequation>
<mathphrase><![CDATA[$some_equation_here$]]></mathphrase>
<inlinemediaobject><textobject><phrase></phrase></textobject></inlinemediaobject>
</inlineequation>
</para>
EOF
$some_equation_here$
Note that pandoc inserts the dollars itself, so those should be removed as well. The above command uses a Lua filter to remove the dollars; unwrap-math.lua
contains
function Math (mth)
mth.text = mth.text:gsub('^%$', ''):gsub('%$$', '')
return mth
end
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论