Haskell正则表达式用于匹配大括号{}内的所有内容。

huangapple go评论74阅读模式
英文:

Haskell regex for matching all content within { }

问题

Sure, here are the translated parts of your text:

我正在编写一个简单的Haskell程序,该程序应该能够根据查找包含在花括号{}中的字符串将一个字符串拆分为字符串列表。

例如,给定字符串:

"1, 2}, {3, 4}, {5, 6}"

它会创建一个类似的列表:

["1, 2", "3, 4", "5, 6"]

我目前不关心任何边缘情况,因为输入字符串将始终具有正确位置的正确数量的花括号。

我认为正则表达式是执行此操作的首选工具,但我对编写正则表达式不太熟悉(几乎没有)。过去,每当我需要在过去使用正则表达式时,我倾向于在互联网上搜索/试错,直到成功,然后下次使用正则表达式时我会忘记我做了什么(因为通常相隔几个月)。

无论如何,我在我的小测试程序中使用了regex-tdfa模块来编译和执行我的正则表达式:

graphRegex = "\{(.*?)\}"

main :: IO ()
main = do
let input = "1, 1}, {2, 2}"
let output = input =~ graphRegex :: String
print output

我已经尝试倚赖标准正则表达式源来尝试生成一个有效的正则表达式,但是有几个特殊字符在regex-tdfa编译器中无法编译,并且该模块的文档对于像我这样的正则表达式初学者来说有点缺乏。

发生在运行时的正则表达式编译错误如下:

haskell-exe: 模块Text.Regex.TDFA.String中的显式错误:Text.Regex.TDFA.String已死:Text.Regex.TDFA.String的parseRegex失败:"\{(.*?)\}"(第1行,第6列):
意外的"?",
期望空的()或锚^或$,原子,"|"或")"
CallStack(来自HasCallStack):
error,位于./Text/Regex/TDFA/Common.hs的第29行3处,位于regex-tdfa-1.2.3.2-JBmdRfKVuE0JoC1GcCugsT:Text.Regex.TDFA.Common

是否有人能够阐明如何使用该模块以及如何将问题逻辑地分解为正则表达式?

编辑:

好的,我已经采纳了Nick Reed的建议并使用了内置的正则表达式。我设法得到了一个编译的正则表达式,它几乎找到了我需要的匹配项:

"\{(.*?)\}"

但是生成的列表是:["1, 1}, {2, 2}"]

仍然包括了{},并且没有找到单独的匹配项,而是仅匹配了整个输入字符串,有谁知道如何在{}上拆分并在结果中省略它们?

编辑2:

以下正则表达式似乎适用于我的非常特定的用例:

(\w, \w)

它捕获由字母数字逗号分隔的字符组。

英文:

I am writing a simple haskell program that should be able to split a string into a list of strings based on looking for strings that are contained with curly braces { }.

For instance, given the string:

"{1, 2}, {3, 4}, {5, 6}"

It would create a list like:

["1, 2", "3, 4", "5, 6"]

I am not concerned at the moment with any edge cases, as the input string will always have the correct amount of braces in the correct places.

I presume regex is the go to tool for doing this, but I'm not very well versed (read barely at all) with writing regex. Whenever I've needed to use regex in the past, I tend to search the internet/brute force trial and error it until I get it and then I forget what I did next time I come to use regex (as this is usually months apart).

Anyway, I'm using the regex-tdfa module to compile and execute my regex in my little test program:

graphRegex = "\\{(.*?)\\}"

main :: IO ()
main = do
  let input = "{1, 1}, {2, 2}"
  let output = input =~ graphRegex :: String
  print output

I've tried leaning on standard regex sources to attempt to generate a working regex, but several special characters are failing to compile with the regex-tdfa compiler and the documentation for that module is a bit lacking for a regex noob such as myself.

The regex compile error, which is happening at run time:

haskell-exe: Explict error in module Text.Regex.TDFA.String : Text.Regex.TDFA.String died: parseRegex for Text.Regex.TDFA.String failed:"\{(.*?)\}" (line 1, column 6):
unexpected "?"
expecting empty () or anchor ^ or $, an atom, "|" or ")"
CallStack (from HasCallStack):
  error, called at ./Text/Regex/TDFA/Common.hs:29:3 in regex-tdfa-1.2.3.2-JBmdRfKVuE0JoC1GcCugsT:Text.Regex.TDFA.Common

Could anyone shed any light on how to use that module and how I should be logically breaking down the problem into a regex?

EDIT:

Ok, I've taken Nick Reeds advice and used the built in regex. I managed to get a compiling regex which does find almost the match I need:

"\\{(.*?)\\}"

But the resulting list is: ["{1, 1}, {2, 2}"]

Which still includes the { } and has not found the individual matches and has instead just matched the entire input string, does anyone know how to split on the { } and omit them from the result?

EDIT 2:

The following regex appears to work for my very specific use case:

(\w, \w)

It captures groups of alphanumeric comma separated characters.

答案1

得分: 1

Using @chepner comment above to arrange a non-greedy match.

$ ghci
GHCi, version 8.6.5: http://www.haskell.org/ghc/  :? for help
 λ> 
 λ> import Text.Regex.Posix
 λ> 
 λ> graphRegex = "\\{([^}]*)\\}"
 λ> 
 λ> input = "ab {1, 1}, xy  {2, 2} cd"
 λ> 
 λ> outputs = getAllTextMatches $ (input =~ graphRegex) :: [String]
 λ> 
 λ> outputs
["{1, 1}","{2, 2}"]
 λ> 

EDIT:

And for the sake of completeness, let's mention a way to produce the list of curly braces delimited words without the curly braces themselves, as initially asked for by the OP.

This can be obtained by adapting @Rudy Matela's answer to a similar question. One needs to force the type of the result of the =~ operator to [[String]]. In that case, given the way the regex is written, each string list represents a match, and the second component of the match is the word without its surrounding curly braces. Like this:

import  Text.Regex.Posix ( (=~) )

extractWordsInCurlyBraces :: String -> [String]
extractWordsInCurlyBraces str =
    let  re1   = "\\{([^}]*)\\}"
         strLs = (str =~ re1) :: [[String]]
    in  map (head . tail) strLs


main = do
    let input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
        cbWords = extractWordsInCurlyBraces input
    putStrLn $ "input   = " ++ show input
    putStrLn $ "cbWords = " ++ show cbWords

Program output:

input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
cbWords = ["1, 2","3, 4","5, 6"]
英文:

Using @chepner comment above to arrange a non-greedy match.

$ ghci
GHCi, version 8.6.5: http://www.haskell.org/ghc/  :? for help
 λ> 
 λ> import Text.Regex.Posix
 λ> 
 λ> graphRegex = "\\{([^}]*)\\}"
 λ> 
 λ> input = "ab {1, 1}, xy  {2, 2} cd"
 λ> 
 λ> outputs = getAllTextMatches $ (input =~ graphRegex) :: [String]
 λ> 
 λ> outputs
["{1, 1}","{2, 2}"]
 λ> 

EDIT:

And for the sake of completeness, let's mention a way to produce the list of curly braces delimited words without the curly braces themselves, as initially asked for by the OP.

This can be obtained by adapting @Rudy Matela's answer to a similar question. One needs to force the type of the result of the =~ operator to [[String]]. In that case, given the way the regex is written, each string list represents a match, and the second component of the match is the word without its surrounding curly braces. Like this:

import  Text.Regex.Posix ( (=~) )

extractWordsInCurlyBraces :: String -> [String]
extractWordsInCurlyBraces str =
    let  re1   = "\\{([^}]*)\\}"
         strLs = (str =~ re1) :: [[String]]
    in  map (head . tail) strLs


main = do
    let input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
        cbWords = extractWordsInCurlyBraces input
    putStrLn $ "input   = " ++ show input
    putStrLn $ "cbWords = " ++ show cbWords

Program output:

input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
cbWords = ["1, 2","3, 4","5, 6"]

答案2

得分: 1

"I presume regex is the go to tool for doing this,

In Haskell we can use regex, and we also have monadic parser libraries like Megaparsec for pattern matching.

Here's how you could split this string using Megaparsec parsers and the splitCap function."

import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer
import Replace.Megaparsec
import Data.Either
import Data.Void

let curlybrace :: Parsec Void String String
    curlybrace = do
        _ <- char '{'
        fst <$> anyTill (char '}')

rights $ splitCap curlybrace "{1, 2}, {3, 4}, {5, 6}"
["1, 2","3, 4","5, 6"]

The nice thing about monadic parsers is that we can not only pattern match, we can also parse the structure of the pattern matches. Based on your example it looks like you might be interested in that.

let curlypair :: Parsec Void String (Integer, Integer)
    curlypair = do
        _ <- char '{'
        num1 <- decimal
        _ <- some $ oneOf " ,"
        num2 <- decimal
        _ <- char '}'
        pure (num1, num2)
        
rights $ splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"
[(1,2),(3,4),(5,6)]

We can also get the non-matching string context surrounding the pattern matches.

splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"
[Right (1,2),Left ", ",Right (3,4),Left ", ",Right (5,6)]
英文:

> I presume regex is the go to tool for doing this,

In Haskell we can use regex, and we also have monadic parser libraries like Megaparsec for pattern matching.

Here's how you could split this string using Megaparsec parsers and the splitCap function.

import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer
import Replace.Megaparsec
import Data.Either
import Data.Void

let curlybrace :: Parsec Void String String
    curlybrace = do
        _ <- char '{'
        fst <$> anyTill (char '}')

rights $ splitCap curlybrace "{1, 2}, {3, 4}, {5, 6}"
["1, 2","3, 4","5, 6"]

The nice thing about monadic parsers is that we can not only pattern match, we can also parse the structure of the pattern matches. Based on your example it looks like you might be interested in that.

let curlypair :: Parsec Void String (Integer, Integer)
    curlypair = do
        _ <- char '{'
        num1 <- decimal
        _ <- some $ oneOf " ,"
        num2 <- decimal
        _ <- char '}'
        pure (num1, num2)
        
rights $ splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"
[(1,2),(3,4),(5,6)]

We can also get the non-matching string context surrounding the pattern matches.

splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"
[Right (1,2),Left ", ",Right (3,4),Left ", ",Right (5,6)]

huangapple
  • 本文由 发表于 2020年1月3日 23:47:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581434.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定