解析文本文件并进行高效拆分

huangapple go评论81阅读模式
英文:

Parsing a textfile and splitting it efficiently

问题

我想解析一个文本文件,大致看起来像这样:

TYPE1=123
TYPE2="SOMETEXT"
TYPE3="SOMETEXT_BUT
ON_MULTIPLE_
LINES"
TYPE4=456

如果一个值跨足多行,它总是带有引号。如果它只跨足一行,它可以带有引号,也可以没有(不幸的是,这不重要,无论是数字还是字符串 - 也可以有一个没有引号的字符串 - 格式不是很一致)

我正在考虑如何通过类型将它们拆分并高效解析。我可以使用 readlines,然后按 "=" 拆分。这对于上面示例中的 TYPE3 之外的所有内容都管用,因为它跨足了多行。

因此,我正在考虑将整个文件读入一个 String,然后进行一些 regex,例如 (.*)=("([^"]*)"|.*\n),这将导致第一个捕获组始终是类型,最后一个捕获组是值。我只是担心对于较大的文件,这可能会太慢并引发问题。

有没有更好/更有效的方法来解决这个解析问题呢?

英文:

I would like to parse a text file which looks more or less like this:

TYPE1=123
TYPE2="SOMETEXT"
TYPE3="SOMETEXT_BUT
ON_MULTIPLE_
LINES"
TYPE4=456

If a value spans multiple lines it always has quotation marks. If it only spans one line it has either quotation marks or not (unfortunately it does not matter if it's a number or string - there can also be a string without quotation marks - the format is not very consistent)

And I'm currently figuring out how I could split them by type and parse it efficiently. I could do a readlines and split by "=". That would work for everything except TYPE3 in the above example because it spans through multiple lines.

So I'm thinking about reading the whole file into a String and then doing some regex, e.g. (.*)=("([^"]*)"|.*\n) which would result in the first capturing group always beeing the type and the last capturing group the value. I just fear that for larger files this might be to slow and cause issues.

Is there a better/more efficient way to solve this parsing problem?

答案1

得分: 1

以下是翻译好的代码部分:

fun readCustomPropertiesFile(file: File): Map<String, String> {
    val map = mutableMapOf<String, String>()
    var entry = ""
    var entryComplete = true
    file.forEachLine { line ->
        if (entryComplete && '=' !in line){
            println("Line is invalid: $line")
            return@forEachLine
        }
        entry = if (entryComplete) line else "$entry\n$line"
        val (key, value) = entry.split('=', limit = 2)
        val startQuote = value.startsWith('"')
        val endQuote = value.endsWith('"')
        entryComplete = !startQuote || startQuote == endQuote
        if (entryComplete) {
            map[key] = if (startQuote && endQuote) value.substring(1, value.length - 1) else value
        }
    }
    return map
}
英文:

I came up with this straight-forward read-through of the lines. Not sure it's more efficient than loading the whole file and using Regex, but it could be useful for huge files since it only reads one line at a time.

fun readCustomPropertiesFile(file: File): Map&lt;String, String&gt; {
    val map = mutableMapOf&lt;String, String&gt;()
    var entry = &quot;&quot;
    var entryComplete = true
    file.forEachLine { line -&gt;
        if (entryComplete &amp;&amp; &#39;=&#39; !in line){
            println(&quot;Line is invalid: $line&quot;)
            return@forEachLine
        }
        entry = if (entryComplete) line else &quot;$entry\n$line&quot;
        val (key, value) = entry.split(&#39;=&#39;, limit = 2)
        val startQuote = value.startsWith(&#39;&quot;&#39;)
        val endQuote = value.endsWith(&#39;&quot;&#39;)
        entryComplete = !startQuote || startQuote == endQuote
        if (entryComplete) {
            map[key] = if (startQuote &amp;&amp; endQuote) value.substring(1, value.length - 1) else value
        }
    }
    return map
}

答案2

得分: 1

你的格式与 .properties 非常相似,使用了 =,并且在稍微不同的形式下使用了多行属性,我会适应这种数据,并使用 Properties。如果使用UTF-8,只需要进行一些小的调整,就可以达到生产质量。

英文:

Your format is so near to .properties, with = and in a bit different form multi-line properties, that I would adapt the data, and use Properties. Using UTF-8 would need a small adaption, and you are done: production quality.

答案3

得分: 0

我认为将文件读取为字符串,然后应用正则表达式的想法会很好。一些要点:

  • 模式中不需要(也可能不想要)\n
  • 你可能只想要TYPE1的值为123,而不是123\n
  • .不会匹配\n,所以.*在命中\n时停止匹配。
  • 如果文件在没有换行符的情况下结束,带有\n的模式将无法匹配,但没有\n的模式将成功匹配。

如果字符串值可能包含=,那么(.*)=将无法工作。例如,如果一行是TYPE2=&quot;SOME=TEXT&quot;,那么(.*)将匹配TYPE2=&quot;SOME,这显然不是你想要的。你可以通过使用([^=]*)=或者根据格式的具体情况也许可以使用(\w*)= 来修复这个问题。

英文:

I think your idea of reading the file into a string and then applying a regex would work fine. Some points:

You don't need (and probably don't want) the \n in the pattern.

  • You probably want only 123 as the value of TYPE1, not 123\n.
  • . doesn't match \n, so .* stops matching if/when it hits a \n anyway.
  • And if it happens that the file ends without a newline, the pattern-with-\n will fail to match, but the pattern-without will succeed.

If it's possible that a string value can contain an =, then (.*)= isn't going to work. E.g., if a line is TYPE2=&quot;SOME=TEXT&quot;, then the (.*) will match TYPE2=&quot;SOME, which you presumably don't want. You can fix this by using ([^=]*)= or maybe (\w*)=, depending on the particulars of the format.

huangapple
  • 本文由 发表于 2020年9月1日 22:21:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/63689603.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定