
huangapple go评论61阅读模式

Haskell regex for matching all content within { }


Sure, here are the translated parts of your text:



"1, 2}, {3, 4}, {5, 6}"


["1, 2", "3, 4", "5, 6"]




graphRegex = "\{(.*?)\}"

main :: IO ()
main = do
let input = "1, 1}, {2, 2}"
let output = input =~ graphRegex :: String
print output



haskell-exe: 模块Text.Regex.TDFA.String中的显式错误:Text.Regex.TDFA.String已死:Text.Regex.TDFA.String的parseRegex失败:"\{(.*?)\}"(第1行,第6列):



好的,我已经采纳了Nick Reed的建议并使用了内置的正则表达式。我设法得到了一个编译的正则表达式,它几乎找到了我需要的匹配项:


但是生成的列表是:["1, 1}, {2, 2}"]




(\w, \w)



I am writing a simple haskell program that should be able to split a string into a list of strings based on looking for strings that are contained with curly braces { }.

For instance, given the string:

"{1, 2}, {3, 4}, {5, 6}"

It would create a list like:

["1, 2", "3, 4", "5, 6"]

I am not concerned at the moment with any edge cases, as the input string will always have the correct amount of braces in the correct places.

I presume regex is the go to tool for doing this, but I'm not very well versed (read barely at all) with writing regex. Whenever I've needed to use regex in the past, I tend to search the internet/brute force trial and error it until I get it and then I forget what I did next time I come to use regex (as this is usually months apart).

Anyway, I'm using the regex-tdfa module to compile and execute my regex in my little test program:

graphRegex = "\\{(.*?)\\}"

main :: IO ()
main = do
  let input = "{1, 1}, {2, 2}"
  let output = input =~ graphRegex :: String
  print output

I've tried leaning on standard regex sources to attempt to generate a working regex, but several special characters are failing to compile with the regex-tdfa compiler and the documentation for that module is a bit lacking for a regex noob such as myself.

The regex compile error, which is happening at run time:

haskell-exe: Explict error in module Text.Regex.TDFA.String : Text.Regex.TDFA.String died: parseRegex for Text.Regex.TDFA.String failed:"\{(.*?)\}" (line 1, column 6):
unexpected "?"
expecting empty () or anchor ^ or $, an atom, "|" or ")"
CallStack (from HasCallStack):
  error, called at ./Text/Regex/TDFA/Common.hs:29:3 in regex-tdfa-

Could anyone shed any light on how to use that module and how I should be logically breaking down the problem into a regex?


Ok, I've taken Nick Reeds advice and used the built in regex. I managed to get a compiling regex which does find almost the match I need:


But the resulting list is: ["{1, 1}, {2, 2}"]

Which still includes the { } and has not found the individual matches and has instead just matched the entire input string, does anyone know how to split on the { } and omit them from the result?


The following regex appears to work for my very specific use case:

(\w, \w)

It captures groups of alphanumeric comma separated characters.


得分: 1

Using @chepner comment above to arrange a non-greedy match.

$ ghci
GHCi, version 8.6.5: http://www.haskell.org/ghc/  :? for help
 λ> import Text.Regex.Posix
 λ> graphRegex = "\\{([^}]*)\\}"
 λ> input = "ab {1, 1}, xy  {2, 2} cd"
 λ> outputs = getAllTextMatches $ (input =~ graphRegex) :: [String]
 λ> outputs
["{1, 1}","{2, 2}"]


And for the sake of completeness, let's mention a way to produce the list of curly braces delimited words without the curly braces themselves, as initially asked for by the OP.

This can be obtained by adapting @Rudy Matela's answer to a similar question. One needs to force the type of the result of the =~ operator to [[String]]. In that case, given the way the regex is written, each string list represents a match, and the second component of the match is the word without its surrounding curly braces. Like this:

import  Text.Regex.Posix ( (=~) )

extractWordsInCurlyBraces :: String -> [String]
extractWordsInCurlyBraces str =
    let  re1   = "\\{([^}]*)\\}"
         strLs = (str =~ re1) :: [[String]]
    in  map (head . tail) strLs

main = do
    let input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
        cbWords = extractWordsInCurlyBraces input
    putStrLn $ "input   = " ++ show input
    putStrLn $ "cbWords = " ++ show cbWords

Program output:

input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
cbWords = ["1, 2","3, 4","5, 6"]

Using @chepner comment above to arrange a non-greedy match.

$ ghci
GHCi, version 8.6.5: http://www.haskell.org/ghc/  :? for help
 λ> import Text.Regex.Posix
 λ> graphRegex = "\\{([^}]*)\\}"
 λ> input = "ab {1, 1}, xy  {2, 2} cd"
 λ> outputs = getAllTextMatches $ (input =~ graphRegex) :: [String]
 λ> outputs
["{1, 1}","{2, 2}"]


And for the sake of completeness, let's mention a way to produce the list of curly braces delimited words without the curly braces themselves, as initially asked for by the OP.

This can be obtained by adapting @Rudy Matela's answer to a similar question. One needs to force the type of the result of the =~ operator to [[String]]. In that case, given the way the regex is written, each string list represents a match, and the second component of the match is the word without its surrounding curly braces. Like this:

import  Text.Regex.Posix ( (=~) )

extractWordsInCurlyBraces :: String -> [String]
extractWordsInCurlyBraces str =
    let  re1   = "\\{([^}]*)\\}"
         strLs = (str =~ re1) :: [[String]]
    in  map (head . tail) strLs

main = do
    let input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
        cbWords = extractWordsInCurlyBraces input
    putStrLn $ "input   = " ++ show input
    putStrLn $ "cbWords = " ++ show cbWords

Program output:

input   = "begin {1, 2}, {3, 4}, mid {5, 6} end"
cbWords = ["1, 2","3, 4","5, 6"]


得分: 1

"I presume regex is the go to tool for doing this,

In Haskell we can use regex, and we also have monadic parser libraries like Megaparsec for pattern matching.

Here's how you could split this string using Megaparsec parsers and the splitCap function."

import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer
import Replace.Megaparsec
import Data.Either
import Data.Void

let curlybrace :: Parsec Void String String
    curlybrace = do
        _ <- char '{'
        fst <$> anyTill (char '}')

rights $ splitCap curlybrace "{1, 2}, {3, 4}, {5, 6}"
["1, 2","3, 4","5, 6"]

The nice thing about monadic parsers is that we can not only pattern match, we can also parse the structure of the pattern matches. Based on your example it looks like you might be interested in that.

let curlypair :: Parsec Void String (Integer, Integer)
    curlypair = do
        _ <- char '{'
        num1 <- decimal
        _ <- some $ oneOf " ,"
        num2 <- decimal
        _ <- char '}'
        pure (num1, num2)
rights $ splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"

We can also get the non-matching string context surrounding the pattern matches.

splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"
[Right (1,2),Left ", ",Right (3,4),Left ", ",Right (5,6)]

> I presume regex is the go to tool for doing this,

In Haskell we can use regex, and we also have monadic parser libraries like Megaparsec for pattern matching.

Here's how you could split this string using Megaparsec parsers and the splitCap function.

import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer
import Replace.Megaparsec
import Data.Either
import Data.Void

let curlybrace :: Parsec Void String String
    curlybrace = do
        _ <- char '{'
        fst <$> anyTill (char '}')

rights $ splitCap curlybrace "{1, 2}, {3, 4}, {5, 6}"
["1, 2","3, 4","5, 6"]

The nice thing about monadic parsers is that we can not only pattern match, we can also parse the structure of the pattern matches. Based on your example it looks like you might be interested in that.

let curlypair :: Parsec Void String (Integer, Integer)
    curlypair = do
        _ <- char '{'
        num1 <- decimal
        _ <- some $ oneOf " ,"
        num2 <- decimal
        _ <- char '}'
        pure (num1, num2)
rights $ splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"

We can also get the non-matching string context surrounding the pattern matches.

splitCap curlypair "{1, 2}, {3, 4}, {5, 6}"
[Right (1,2),Left ", ",Right (3,4),Left ", ",Right (5,6)]

  • 本文由 发表于 2020年1月3日 23:47:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581434.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
