如何从序列中的任意解析器中访问整个输入?

huangapple go评论64阅读模式
英文:

How can I access the whole input from an arbitrary parser in a sequence?

问题

你正在处理DNS消息解析器的代码,主要使用了megaparsec库。你遇到的问题是在处理记录(Record)部分的压缩情况时,需要考虑整个输入,但是在达到第三个解析器时,部分输入已经被消耗,导致提前到达输入末尾而解析失败。对于这种情况,有一些可能的解决方法和建议:

  1. 使用M.lookAhead 尽管你提到了它可能会涉及到来回移动偏移量,但是在某些情况下,使用M.lookAhead仍然是一个有效的方法。你可以尝试将其应用在前两个解析器上,以查看接下来的内容,而不会消耗输入。

  2. 手动管理偏移量: 如果megaparsec没有提供直接的解决方法,你可以尝试手动管理偏移量。在解析压缩部分之前,记录当前的偏移量,然后在处理完压缩部分后将偏移量还原到之前记录的位置。

  3. 考虑其他解析库: 如果megaparsec不够灵活,你可以考虑使用其他解析库,如attoparsecparsec,它们可能提供更适合处理这种情况的API。

  4. 重构代码: 有时,重构代码可以使问题更容易解决。你可以考虑将解析记录的部分拆分成多个子解析器,以更好地处理压缩情况。

最终的解决方案可能取决于你的具体需求和代码结构。你可以尝试上述建议中的一个或多个,以找到适合你情况的最佳方法。如果你需要更多具体的代码示例或进一步的帮助,请随时提出。

英文:

I'm working through a DNS message parser. I have defined the following using megaparsec:

Header

data DNSHeader = DNSHeader
  { hid :: !Word16,
    hflags :: !Word16,
    hnumQuestions :: !Word16,
    hnumAnswers :: !Word16,
    hnumAuthorities :: !Word16,
    hnumAdditionals :: !Word16
  }
  deriving stock (Show)

-- >>> _debugBuilderOutput $ header2Bytes (DNSHeader 1 2 3 4 5 6)
-- "000100020003000400050006"
header2Bytes :: DNSHeader -> Builder
header2Bytes h =
  word16BE (hid h)
    <> word16BE (hflags h)
    <> word16BE (hnumQuestions h)
    <> word16BE (hnumAnswers h)
    <> word16BE (hnumAuthorities h)
    <> word16BE (hnumAdditionals h)

parseHeader :: M.Parsec Void B.ByteString DNSHeader
parseHeader =
  DNSHeader
    <$> M.word16be
    <*> M.word16be
    <*> M.word16be
    <*> M.word16be
    <*> M.word16be
    <*> M.word16be

Question

data DNSQuestion = DNSQuestion
  { qname :: B.ByteString,
    qtype :: !Word16,
    qclass :: !Word16
  }
  deriving stock (Show)

parseQuestion :: M.Parsec Void B.ByteString DNSQuestion
parseQuestion =
  DNSQuestion
    <$> decodeDNSNameSimple
    <*> M.word16be
    <*> M.word16be

decodeDNSNameSimple :: M.Parsec Void B.ByteString B.ByteString
decodeDNSNameSimple = do
  len <- M.word8
  if len == 0
    then pure mempty
    else do
      name <- B.pack <$> replicateM (fromIntegral len) M.word8
      rest <- decodeDNSNameSimple
      pure $ name <> (if B.null rest then mempty else "." <> rest)

Record

data DNSRecord = DNSRecord
  { rname :: B.ByteString,
    rtype :: !Word16,
    rclass :: !Word16,
    rttl :: !Word32,
    rdataLength :: !Word16,
    rdata :: B.ByteString
  }
  deriving stock (Show)

parseRecord :: M.Parsec Void B.ByteString DNSRecord
parseRecord = do
  name <- decodeDNSName
  rtype <- M.word16be
  rclass <- M.word16be
  rttl <- M.word32be
  rdataLength <- M.word16be
  rdata <- B.pack <$> replicateM (fromIntegral rdataLength) M.word8
  pure $ DNSRecord name rtype rclass rttl rdataLength rdata

decodeDNSName :: M.Parsec Void B.ByteString B.ByteString
decodeDNSName = do
  len <- M.word8
  if len == 0
    then pure mempty
    else
      if (len .&. 0b1100_0000) == 0b1100_0000
        then decodeCompressedDNSName len
        else do
          name <- B.pack <$> replicateM (fromIntegral len) M.word8
          rest <- decodeDNSName
          pure $ name <> (if B.null rest then mempty else "." <> rest)

decodeCompressedDNSName :: Word8 -> M.Parsec Void B.ByteString B.ByteString
decodeCompressedDNSName l = do
  offset' <- M.word8
  let bytes = ((fromIntegral l :: Word16) .&. 0b0011_1111) `shiftL` 8
      pointer = bytes .|. (fromIntegral offset' :: Word16)
  currentPos <- M.getOffset
  -- TODO: get to the offset defined by pointer (considering the whole input)
  -- M.setOffset (fromIntegral pointer) ???
  result <- decodeDNSName
  M.setOffset currentPos
  pure result

Parsing header, question and record

parseDNSResponse :: M.Parsec Void B.ByteString (DNSHeader, DNSQuestion, DNSRecord)
parseDNSResponse = do
  header <- parseHeader
  question <- parseQuestion
  record <- parseRecord
  pure (header, question, record)

I'm currently stuck in the last function of the Record section, which handles possible compression (len == 0b1100_0000). I need to consider the whole input when moving the offset, and when I get to the third parser part of the input has already been consumed, so I reach end of input early and the parsing fails. Doing a M.lookAhead of the 2 first parsers would require going back and forth with the offsets. I have tried a few things without success and I'm getting a bit lost. Am I in the right direction here? Do you have a recommendation?

This is the example I'm trying to parse:

_exampleResponse :: B.ByteString
_exampleResponse = Base16.decodeLenient "e35d8180000100010000000003777777076578616d706c6503636f6d0000010001c00c000100010000508900045db8d822"

Header and question parsing works otherwise.

I could consider moving to other methods or parsers (attoparsec, etc) if they have an API more aligned with my use case, so feel free to suggest alternatives.

答案1

得分: 1

以下是您要翻译的部分:

"Turns out you can do this with megaparsec, as it allows manipulating parser state without consuming the inputs in the process. Not sure if there are other alternatives that offer this.

Calling getInput at the very beginning of your parser gives the whole input, to which you can then apply a modified parseRecord so you can modify the parser input mid-parsing, do what you need, and then restore the previous state:"

英文:

Turns out you can do this with megaparsec, as it allows manipulating parser state without consuming the inputs in the process. Not sure if there are other alternatives that offer this.

Calling getInput at the very beginning of your parser gives the whole input, to which you can then apply a modified parseRecord so you can modify the parser input mid-parsing, do what you need, and then restore the previous state:

import Text.Megaparsec qualified as M

type DNSParser a = M.Parsec Void B.ByteString a -- For readability

parseDNSPacket :: DNSParser DNSPacket
parseDNSPacket = do
  fullInput <- M.getInput                  -- 1. Get input to this parser
  let parseRecord' = parseRecord fullInput -- 2. Apply the parser to it
  header <- parseHeader
  -- ...
  answers <- replicateM (fromIntegral $ hnumAnswers header) parseRecord'
  -- ... ...

parseRecord :: B.ByteString -> DNSParser DNSRecord
parseRecord fullInput = do
  name <- decodeDNSName fullInput
  -- ... ...

decodeDNSName :: B.ByteString -> DNSParser B.ByteString
decodeDNSName input = do
  len <- M.word8
  -- ...
      if (len .&. 0b1100_0000) == 0b1100_0000
        then decodeCompressedDNSName input len
        else do
  -- ... ...

decodeCompressedDNSName :: B.ByteString -> Word8 -> DNSParser B.ByteString
decodeCompressedDNSName input len = do
  offset' <- fromIntegral <$> M.word8
  let bytes = ((fromIntegral len :: Word16) .&. 0b0011_1111) `shiftL` 8
      pointer = fromIntegral (bytes .|. offset')
  currentPos <- M.getOffset   -- 3. Save the current input
  M.setInput input            -- 4. Set input to the argument `i`
  M.skipCount pointer M.word8 -- 5. Perform the full input parsing
  name <- decodeDNSName input
  M.setInput currentInput     -- 6. Restore the previous input
  pure name                   -- 7. Profit!

Now it passes simple lookup tests!

huangapple
  • 本文由 发表于 2023年5月15日 06:09:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76249884.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定