保留多行字符串,往返时保持不变,使用ruamel。

huangapple go评论59阅读模式
英文:

Preserving multi-line string as is when round-triping in ruamel

问题

我可以为你提供代码的中文翻译部分:

import sys
import ruamel.yaml
from pathlib import Path

yaml = ruamel.yaml.YAML()  # 默认为往返
yaml.allow_duplicate_keys = True
yaml.preserve_quotes = True
yaml.explicit_start = True
file_name = "ca.yml"

with open(file_name) as fp:
    data = yaml.load(fp)

with open(file_name, 'w') as fp:
    yaml.dump(data, fp)

请告诉我如果你需要更多的帮助。

英文:

Suppose I have a file like so

test:
    long: "This is a sample text
      across two lines."

When I load the file and dump it back with no changes to the file, it changes this document into

test:
    long: "This is a sample text\
      \ across two lines."

While this is correct and doesn't change the actual value, for huge YAML files this creates a lot of diffs and becomes difficult to look at the valid ones.

This is the code I have used so far

import sys
import ruamel.yaml
from pathlib import Path

yaml = ruamel.yaml.YAML()  # defaults to round-trip
yaml.allow_duplicate_keys = True
yaml.preserve_quotes = True
yaml.explicit_start = True
file_name = "ca.yml"

with open(file_name) as fp:
    data = yaml.load(fp)

with open(file_name, 'w') as fp:
    yaml.dump(data, fp)

Could someone help me understand if there are some settings I'll be able to use to achieve this? or in case it's not possible any workarounds to do the same.

答案1

得分: 1

以下是翻译好的部分:

这段代码添加到了 ruamel.yaml 0.17.23 版本中。

我无法重现输出,所以似乎有些东西丢失了。在我的测试中,反斜杠消失了,这是我预料之中的,因为我不记得在双引号标量中处理换行符的特殊代码,而 AFAICT 那只是为了折叠块样式标量才添加的,但那不是问题的原因。

有一些事情让我感到奇怪:

  • 你的输出缩进得好像在你的 YAML 实例上设置了 .indent(mapping=4),但你的代码不反映这一点。
  • 你的代码设置了 .explicit_start = True,但你的输出不反映这一点。
  • 你的输出换行(在第30列左右),但没有相关的代码。

玩弄了一下,我可以得到你的输出,当我将 .width 设置为27-32的值时,并且如果不设置 preserve_quotes,输出就不会包含反斜杠(但也没有引号)。

import sys
import ruamel.yaml

yaml_str = """\
test:
    long: "This is a sample text
      across two lines."
"""

for pq in [True, False]:
    yaml = ruamel.yaml.YAML()  # 默认为往返
    yaml.preserve_quotes = pq
    yaml.indent(mapping=4)
    yaml.width = 27
    yaml.allow_duplicate_keys = True
    yaml.explicit_start = True

    data = yaml.load(yaml_str)
    # 检查加载的数据中没有隐藏的空格或换行符
    assert data["test"]["long"] == 'This is a sample text across two lines.'
    yaml.dump(data, sys.stdout)

这会产生:

---
test:
    long: "This is a sample text\
        \ across two lines."
---
test:
    long: This is a sample text
        across two lines.

因此,这似乎与以样式 '"' 值转储字符串的代码有关。

顺便说一句,我建议在进行这种测试时不要覆盖输入,而是在需要进行文件到文件加载/转储时从代码中写入输入,或者在进行视觉检查时使用字符串输入和 sys.stdout 输出。

这些问题是多年前从 PyYAML 复制的代码引起的:

import sys
import yaml  # PyYAML

data = yaml.safe_load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.safe_dump(data, sys.stdout, indent=4, width=27, default_style='"')

这会产生:

"test":
    "long": "This is a sample\
        \ text across two lines."

这导致了 emitter.py 中的 write_double_quoted 函数的代码:

class MyEmitter(ruamel.yaml.emitter.Emitter):
    def write_double_quoted(self, text, split=True):
        # 省略部分代码

最后,你应该小心使用 allow_duplicate_keys,如果文档中存在重复的键,它会改变你的输出,可能不具有与加载原始文档的其他程序相同的语义。

此外,你还应该考虑在包含 YAML 文档的文件上使用 .yaml 扩展名,假设使用此文档的其他程序可以处理它。自2006年9月以来,这已经是推荐的扩展名了,所以我希望自那时以来有些其他程序已经更新了他们的代码。

英文:

This code was added to ruamel.yaml 0.17.23


I cannot recreate the output so something seems to be missing. In my tests the backslashes went missing, which I expected
as I don't recall there is special code for handling newlines in a double quoted scalar, and AFAICT that
was only added for folded block style scalars, but that was not the problem.

There are a few things that are strange to me:

  • your output is indented as if .indent(mapping=4) was set on your YAML instance
    but your code doesn't reflect that.
  • your code sets .explicit_start = True, but your output doesn't reflect that.
  • your output wraps (around column 30), but there is no code for that.

Playing around a bit I could get your output when I set the .width to a value of 27-32,
and that if you don't set preserve_quotes the output doesn't get the backslashes (but also not
the quotes):

import sys
import ruamel.yaml

yaml_str = """\
test:
    long: "This is a sample text
      across two lines."
"""

for pq in [True, False]:
    yaml = ruamel.yaml.YAML()  # defaults to round-trip
    yaml.preserve_quotes = pq
    yaml.indent(mapping=4)
    yaml.width = 27
    yaml.allow_duplicate_keys = True
    yaml.explicit_start = True

    data = yaml.load(yaml_str)
    # check there are no hidden spaces or newlines in the loaded data
    assert data["test"]["long"] == 'This is a sample text across two lines.'
    yaml.dump(data, sys.stdout)

which gives:

---
test:
    long: "This is a sample text\
        \ across two lines."
---
test:
    long: This is a sample text
        across two lines.

So this seems to have to do specifically with the code that dumps strings with style '"'

BTW, I can recommend not overwriting the input during this kind of testing, instead write
the input from the code if you need to do file-to-file loading/dumping, or use
string input and sys.stdout output (when doing visual inspection).

This garbage is caused by code forked from PyYAML years ago:

import sys
import yaml  # PyYAML

data = yaml.safe_load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.safe_dump(data, sys.stdout, indent=4, width=27, default_style='"')

which gives:

"test":
    "long": "This is a sample\
        \ text across two lines."

and that leads to the code for write_double_quoted in emitter.py:

class MyEmitter(ruamel.yaml.emitter.Emitter):
    def write_double_quoted(self, text, split=True):
        if self.root_context:
            if self.requested_indent is not None:
                self.write_line_break()
                if self.requested_indent != 0:
                    self.write_indent()
        self.write_indicator(u'"', True)
        start = end = 0
        while end <= len(text):
            ch = None
            if end < len(text):
                ch = text[end]
            if (
                ch is None
                or ch in u'"\\\x85\u2028\u2029\uFEFF'
                or not (
                    u'\x20' <= ch <= u'\x7E'
                    or (
                        self.allow_unicode
                        and (u'\xA0' <= ch <= u'\uD7FF' or u'\uE000' <= ch <= u'\uFFFD')
                    )
                )
            ):
                if start < end:
                    data = text[start:end]
                    self.column += len(data)
                    if bool(self.encoding):
                        data = data.encode(self.encoding)
                    self.stream.write(data)
                    start = end
                if ch is not None:
                    if ch in self.ESCAPE_REPLACEMENTS:
                        data = u'\\' + self.ESCAPE_REPLACEMENTS[ch]
                    elif ch <= u'\xFF':
                        data = u'\\x%02X' % ord(ch)
                    elif ch <= u'\uFFFF':
                        data = u'\\u%04X' % ord(ch)
                    else:
                        data = u'\\U%08X' % ord(ch)
                    self.column += len(data)
                    if bool(self.encoding):
                        data = data.encode(self.encoding)
                    self.stream.write(data)
                    start = end + 1
            if (
                0 < end < len(text) - 1
                and (ch == u' ' or start >= end)
                and self.column + (end - start) > self.best_width
                and split
            ):
                # data = text[start:end] + u'\\'  # <<< replaced with following two lines
                need_backquote = text[end] == u' ' and (len(text) > end) and text[end+1] == u' '
                data = text[start:end] + (u'\\' if need_backquote else u'')
                if start < end:
                    start = end
                self.column += len(data)
                if bool(self.encoding):
                    data = data.encode(self.encoding)
                self.stream.write(data)
                self.write_indent()
                self.whitespace = False
                self.indention = False
                if text[start] == u' ':
                    if not need_backquote:
                        # remove leading space it will load from the newline
                        start += 1 
                    # data = u'\\'    # <<< replaced with following line
                    data = u'\\' if need_backquote else u''
                    self.column += len(data)
                    if bool(self.encoding):
                        data = data.encode(self.encoding)
                    self.stream.write(data)
            end += 1
        self.write_indicator(u'"', False)

yaml = ruamel.yaml.YAML()
yaml.Emitter = MyEmitter
yaml.preserve_quotes = True
yaml.indent(mapping=4)
yaml.width = 27

data = yaml.load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.dump(data, sys.stdout)


which gives:

test:
long: "This is a sample text
across two lines."

Which looks like what you want.

The the code block around the two changed lines generates correctly loadable string, as you noted. !t just deals
very conservatively with potential multiple spaces around the point where a newline is inserted, which
is correct for PyYAML, which has no pretentions to preserve the original YAML document, but incorrect
for ruamel.yaml. Without those backslahses extra
spaces would otherwise disappear during loading.

yaml_str = 'test:\n    long:\n      "This is a sample text  across two lines."'

yaml = ruamel.yaml.YAML()
yaml.Emitter = MyEmitter
yaml.preserve_quotes = True
yaml.indent(mapping=4)
yaml.width = 27

data = yaml.load(yaml_str)
assert data["test"]["long"] == 'This is a sample text  across two lines.'
yaml.dump(data, sys.stdout)

gives:

test:
long: "This is a sample text\
\  across two lines."

because of the double spaces.

It doesn't look like the above has other side-effects, but this has not been further tested.

You should take care with using allow_duplicate_keys, it will change your output if you have them,
and possible not with the same semantics as another program loading the original document.

You should also consider using the .yaml extension on files containing YAML documents, assuming
the other programs using this document can handle that. That
has been the recommended extension since at least Septebmer 2006, so I hope some others updated their code
since then.

huangapple
  • 本文由 发表于 2023年3月4日 04:07:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75631454.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定