将adoc转换为markdown,同时保留latex样式的数学公式。

huangapple go评论55阅读模式
英文:

Converting adoc to markdown while preserving latex style math equations

问题

我有一组adoc文档,我正在将它们转换为markdown格式。对于大多数文档,我已经成功地进行了转换:

``` sh
asciidoc -b docbook -o temp.xml <infile>
pandoc -f docbook -t markdown_strict --atx-headers --mathjax temp.xml -o <outfile>

然后,我使用一些正则表达式来修复一些破损的图片链接并修复标题。然而,对于内联数学公式,这种方法不起作用。在adoc中,它们的语法是:latexmath:[$some_equation_here$],有时多行公式中没有美元符号。

当这些公式转换为DocBook XML时,它们被保留并且格式如下:

<inlineequation>
<alt><![CDATA[$some_equation_here$]]></alt>
<inlinemediaobject><textobject><phrase></phrase></textobject></inlinemediaobject>
</inlineequation>

但是当pandoc将其转换回markdown时,它忽略了这些xml块。在pandoc转换过程中,如何保持markdown可读的公式格式($some_equation_here$)呢?mathjax扩展似乎无法解决这个问题。

我尝试使用单独的Python正则表达式,使用 re.sub(r'latexmath:\[\$?(.*?)\$?\]', r'$\g<1>$', file_contents) 来保留$,但结果是一些双重转义的文本,然后必须手动修复,而且有时会产生额外的/sup标签。尝试在XML文件中进行类似操作也产生了类似的结果。


<details>
<summary>英文:</summary>

I have a group of adoc documents that I&#39;m converting to markdown. For most of them I&#39;ve been able to convert them with:

``` sh
asciidoc -b docbook -o temp.xml &lt;infile&gt;
pandoc -f docbook -t markdown_strict --atx-headers --mathjax temp.xml -o &lt;outfile&gt;

followed by some regex to clean up some broken image links and fix the headers. However, this doesn't work for the in-line math equations. In the adoc they are in the syntax: latexmath:[$some_equation_here$] sometimes without the dollar signs for multi-line equations.

when this gets turned into the DocBook XML it seems to be preserved and is of the format:

&lt;inlineequation&gt;
&lt;alt&gt;&lt;![CDATA[$some_equation_here$]]&gt;&lt;/alt&gt;
&lt;inlinemediaobject&gt;&lt;textobject&gt;&lt;phrase&gt;&lt;/phrase&gt;&lt;/textobject&gt;&lt;/inlinemediaobject&gt;
&lt;/inlineequation&gt;

but when pandoc converts it back to markdown it ignores these blocks of xml. How can i keep it in a markdown readable equation ($some_equation_here$) format during the pandoc conversion? The mathjax extension doesn't seem to be helping with this operation.

I tried to use a seperate python regex that would use re.sub(r&#39;latexmath:\[\$?(.*?)\$?\]&#39;, r&#39;$\g&lt;1&gt;$&#39;, file_contents to keep the $ but it results in some double escaped text that then has to go be fixed manually as well as not fully working sometimes giving some extra /sup tags. Trying to do something similar with the XML file resulted in similar results.

答案1

得分: 0

Here is the translated content:

根据 pandoc 代码,DocBook 阅读器期望公式位于 <inlineequation> 元素下的 <mathphrase> 元素中。因此,只需将 <alt> 标签替换为 <mathphrase> 即可使 pandoc 捕捉到公式。一般情况下,这会生成无效的 DocBook XML,因为 <inlineequation> 应该包含 要么 <mathphrase> 要么 <inlinemediaobjects>,但这对 pandoc 并不重要。

注意,pandoc 会自己插入美元符号,所以这些也应该被移除。上述命令使用 Lua 过滤器 来移除美元符号;unwrap-math.lua 包含以下内容:

function Math (mth)
  mth.text = mth.text:gsub('^%$', ''):gsub('%$$', '')
  return mth
end
英文:

Looking at the pandoc code it seems that the DocBook reader expects the formula to be in an &lt;mathphrase&gt; element below &lt;inlineequation&gt;. Thus, replacing the &lt;alt&gt; tags with &lt;mathphrase&gt; is enough to get the equation to be picked up by pandoc. This yields invalid DocBook XML in general, as the &lt;inlineequation&gt; should contain either a &lt;mathphrase&gt; or &lt;inlinemediaobjects&gt;, but that doesn't matter for pandoc.

cat &lt;&lt; EOF | pandoc --from=docbook --to markdown --lua-filter=unwrap-math.lua
&lt;para&gt;
  &lt;inlineequation&gt;
    &lt;mathphrase&gt;&lt;![CDATA[$some_equation_here$]]&gt;&lt;/mathphrase&gt;
    &lt;inlinemediaobject&gt;&lt;textobject&gt;&lt;phrase&gt;&lt;/phrase&gt;&lt;/textobject&gt;&lt;/inlinemediaobject&gt;
  &lt;/inlineequation&gt;
&lt;/para&gt;
EOF
$some_equation_here$

Note that pandoc inserts the dollars itself, so those should be removed as well. The above command uses a Lua filter to remove the dollars; unwrap-math.lua contains

function Math (mth)
  mth.text = mth.text:gsub(&#39;^%$&#39;, &#39;&#39;):gsub(&#39;%$$&#39;, &#39;&#39;)
  return mth
end

huangapple
  • 本文由 发表于 2023年4月20日 03:09:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76058051.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定