Last week, we discussed the utility of writing simple tools from scratch. We started writing a simple HTTP server by defining our request and response types. Today, we’ll write a parser for the HTTP request.
We’ll use the Megaparsec library throughout this article, without dwelling on too many details of its usage. If you want to learn all about parsing in Haskell from the ground up, including all the details of using this library, check out our course, Solve.hs. Module 4 in particular will teach you all about parsing!
Outlining our Parser
In our Haskell LeetCode series, we discussed top-down a…
Last week, we discussed the utility of writing simple tools from scratch. We started writing a simple HTTP server by defining our request and response types. Today, we’ll write a parser for the HTTP request.
We’ll use the Megaparsec library throughout this article, without dwelling on too many details of its usage. If you want to learn all about parsing in Haskell from the ground up, including all the details of using this library, check out our course, Solve.hs. Module 4 in particular will teach you all about parsing!
Outlining our Parser
In our Haskell LeetCode series, we discussed top-down and bottom-up implementations of functions. In a top-down implementation, we can start with the high-level aspects of our program, leaving many parts as undefined temporarily. Then we’ll fill in the details as we go along. In a bottom-up implementation, we first figure out the smaller details and combine them until we get our complete program.
In reality, you can often use both of these together! To write this parser, we’ll begin with a top-down outline, then we’ll fill in some helper functions that we know will be useful. Finally we’ll fill in the “middle” of our program.
As a reminder, we are following this RFC guide on HTTP Version 1.1. Here are all the Haskell types we wrote last week to describe an HTTP Request:
import Data.Word (Word8)
import qualified Data.HashMap.Lazy as HM
import Data.ByteString.Lazy (ByteString)
data HttpMethod =
HttpOptions | HttpGet | HttpHead | HttpPost | HttpPut |
HttpDelete | HttpTrace | HttpConnect
deriving (Show, Eq)
newtype HttpHeaders = HttpHeaders
(HM.HashMap ByteString ByteString)
deriving (Show, Eq)
data HttpRequest = HttpRequest
{ requestMethod :: HttpMethod
, requestUri :: ByteString
, requestHttpVersion :: (Word8, Word8)
, requestHeaders :: HttpHeaders
, requestBody :: Maybe ByteString
}
deriving (Show, Eq)
First, let’s define the type for our parsing function. Our Megaparsec Stream type will be ByteString, and we’ll be operating in the IO monad on our server. So the type should look something like this:
parseHttpRequest :: ParsecT Void ByteString IO HttpRequest
parseHttpRequest = undefined
We’ll be writing many type signatures with this parameterized ParsecT type though. So let’s simplify this with an alias:
type Parser a = ParsecT Void ByteString IO a
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = undefined
Now that nice thing about monadically parsing a type is that, as an outline, we can parse each of the fields in order and combine them, looking something like this:
-- Not a final version!
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = do
m <- parseMethod
u <- parseUri
v <- parseVersion
h <- parseHeaders
b <- parseBody
return $ HttpRequest m u v h b
In the spirit of top-down implementation, we can then define functions for each of these fields:
parseMethod :: Parser HttpMethod
parseUri :: Parser ByteString
parseHttpVersion :: Parser (Word8, Word8)
parseHeaders :: Parser HttpHeaders
parseBody :: Parser (Maybe ByteString)
-- Not a final version!
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = do
m <- parseMethod
u <- parseUri
v <- parseVersion
h <- parseHeaders
b <- parseBody
return $ HttpRequest m u v h b
This is already a great start, but we don’t want to forget about some of the formatting elements of the request. For example, there are spaces between the method and URI, and then the URI and version. There are CRLF sequences between the version and headers, and the headers and body. So we may wish to define helpers that specifically parse those sequences.
sp :: Parser ()
crlf’ :: Parser ()
-- Not a final version!
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = do
m <- parseMethod
sp
u <- parseUri
sp
v <- parseVersion
crlf
h <- parseHeaders
crlf
b <- parseBody
return $ HttpRequest m u v h b
Now we’ve got a definite outline, as well as some motivation for helper functions. This still isn’t our final parseHttpRequest function, but it’s given us sufficient information to switch to bottom-up implementation to write some of these helpers.
Helper Functions
One of the important roles of helper functions is to smooth out rough edges of library functions. In our case, we want to define a lot of helpers using characters (Char type). However, since we’re parsing bytestrings, the “token” of our parsing is actually Word8. So we’ll write a simple function to convert a character to Word8, using Data.Char (ord):
import Data.Char
o :: Char -> Word8
o = fromIntegral . ord
Now we can use this to fill in sp, writing a simple Parser for a single space character. We use single from the Parsec library:
sp :: Parser ()
sp = void $ single (o ' ')
With URIs, we want to parse any character except a space. So we can use the opposite function anySingleBut to do that:
nonSp :: Parser Word8
nonSp = anySingleBut (o ' ')
We also observe that the CRLF sequence is common, so we’ll write a parser that gives us these two characters:
crlf' :: Parser ()
crlf' = void $ do
single (o '\r')
single (o '\n')
For the HTTP version, we’ll want to parse a digit and convert it to Word8. This involves two tricks. First, we use satisfy with a predicate to only capture characters 0-9. Then we subtract 48 from them! This is because we want the number, not the ASCII code. Since 48 is the ASCII code for 0, this subtraction gives us the actual digit value.
digitW :: Parser Word8
digitW = (\x -> x - (o '0')) <$> satisfy (\c -> c >= (o '0') && c <= (o '9'))
Finally, I’ll share an important helper that isn’t so intuitive. Our parsers will capture individual Word8 characters, often producing the type [Word8] when we combine them. We want to easily turn a Parser of this list into a Parser of a ByteString. To do this, we want to use Data.ByteString.pack like so:
import qualified Data.ByteString.Lazy as BS
mkBS :: Parser [Word8] -> Parser ByteString
mkBS = fmap BS.pack
We can combine our helpers to capture even more patterns! For example, here’s a parser that will capture all characters up until a CRLF sequence, and then consume the CRLF. It uses o, crlf’ and mkBS, as well as the someTill combinator from Megaparsec.
parseTillCrlf :: Parser ByteString
parseTillCrlf = mkBS (someTill (anySingleBut (o '\r')) crlf')
That’s all the helpers we’ll need! Now let’s consider the individual components of the HTTP Request, one by one.
Parsing a Method
We’ll start with the first part of our request line, the “method”. We have a finite number of methods as defined by this enum:
data HttpMethod =
HttpOptions | HttpGet | HttpHead | HttpPost | HttpPut |
HttpDelete | HttpTrace | HttpConnect
deriving (Show, Eq)
This means we can define all the pairs of strings to parse with the proper constructor:
parseMethod :: Parser HttpMethod
parseMethod = ...
where
pairs :: [(ByteString, HttpMethod)]
pairs =
[ ("OPTIONS", HttpOptions)
, ("GET", HttpGet)
, ("HEAD", HttpHead)
, ("POST", HttpPost)
, ("PUT", HttpPut)
, ("DELETE", HttpDelete)
, ("TRACE", HttpTrace)
, ("CONNECT", HttpConnect)
]
...
Now we can write a function that will turn one of these pairs into a Parser. This parser will use string to parse the string, and then return the method constructor:
parseMethod :: Parser HttpMethod
parseMethod = ...
where
pairs :: [(ByteString, HttpMethod)]
pairs = ...
p :: (ByteString, HttpMethod) -> Parser HttpMethod
p (s, m) = string s >> return m
Finally, we use map to create a list of Parser HttpMethod items, and then we use the choice combinator from Megaparsec to parse one of these options:
parseMethod :: Parser HttpMethod
parseMethod = choice parsers
where
pairs :: [(ByteString, HttpMethod)]
pairs =
[ ("OPTIONS", HttpOptions)
, ("GET", HttpGet)
, ("HEAD", HttpHead)
, ("POST", HttpPost)
, ("PUT", HttpPut)
, ("DELETE", HttpDelete)
, ("TRACE", HttpTrace)
, ("CONNECT", HttpConnect)
]
p :: (ByteString, HttpMethod) -> Parser HttpMethod
p (s, m) = string s >> return m
parsers :: [Parser HttpMethod]
parsers = map p pairs
And now our method parser is complete!
Parsing the URI
Parsing the URI is quite easy with our helpers:
parseUri :: Parser ByteString
parseUri = mkBS $ someTill nonSp sp
This definition has 3 steps:
- Parse a series of “non-space” characters
- End the parsing once we see a “space” character
- Convert this to a
ByteString.
The someTill combinator from Megaparsec does most of the work here. What is notable is that this parser will consume the space separator, but does not include it as part of the result. So we won’t need to include this separator in parseHttpRequest.
Parsing the Version
Parsing the version is even simpler as a step-by-step process.
- Parse the string
HTTP/ - Parse the major version digit. (We’ll limit to 1 digit to keep things simple)
- Parse the period separator
- Parse the minor version digit.
We already know about the string function from Megaparsec, and we have our digitW helper, so this is easy:
parseHttpVersion :: Parser (Word8, Word8)
parseHttpVersion = do
string "HTTP/"
d1 <- digitW
single (o '.')
d2 <- digitW
return (d1, d2)
At this point, it’s worth checking in on our main request function. Here’s what is implemented so far:
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = do
m <- parseMethod
sp
u <- parseUri
v <- parseHttpVersion
crlf'
h <- undefined
b <- undefined
return $ HttpRequest m u v h b
We still have a sp separator between the method and URI. But we don’t need one between the URI and version since parseUri consumes that space. But we’ll still keep the crlf’ between the version and the headers.
Parsing Headers
Now we need to parse the headers, bearing in mind that there could be 0 of them. In this case, the headers section consists of a single CRLF sequence. The key trick here is to view the header parser as a recursive loop, maintaining the state of the header map we’ve parsed so far. So its type signature will actually look like this:
parseHeaders :: HM.HashMap ByteString ByteString -> Parser HttpHeaders
To implement this function, we need to consider two different cases. In the first case, we see a CRLF. This means we are at the end of the headers, because a header key cannot start (or even contain) a carriage return. We’ll wrap our map into HttpHeaders and return it. Thus we can start our function like this:
parseHeaders :: HM.HashMap ByteString ByteString -> Parser HttpHeaders
parseHeaders prev = crlf' >> return (HttpHeaders prev)
But if we don’t encounter a CRLF, we need to actually parse the key-value pair. We use the <|> combinator to give ourselves a second option if the CRLF parser fails.
parseHeaders :: HM.HashMap ByteString ByteString -> Parser HttpHeaders
parseHeaders prev = (crlf' >> return (HttpHeaders prev)) <|> do
...
We’ll then parse all the characters until a colon : for the header name, followed by parseTillCrlf for the value characters. (Recall that each individual header line has its own CRLF). The “name” portion uses the same pattern as parseTillCrlf, except with a colon):
parseHeaders :: HM.HashMap ByteString ByteString -> Parser HttpHeaders
parseHeaders prev = (crlf' >> return (HttpHeaders prev)) <|> do
headerName <- mkBS (someTill (anySingleBut (o ':')) (single $ o ':'))
headerBody <- parseTillCrlf
...
Now we store this key/value pair in our HashMap, and then we recurse with our new map:
parseHeaders :: HM.HashMap ByteString ByteString -> Parser HttpHeaders
parseHeaders prev = (crlf' >> return (HttpHeaders prev)) <|> do
headerName <- mkBS (someTill (anySingleBut (o ':')) (single $ o ':'))
headerBody <- parseTillCrlf
parseHeaders (HM.insert headerName headerBody prev)
Our request parser now looks like this:
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = do
m <- parseMethod
sp
u <- parseUri
v <- parseHttpVersion
crlf'
h <- parseHeaders HM.empty
...
return $ HttpRequest m u v h undefined
The headers parser consumes the CRLF separating the headers from the body, so we don’t need that separator there.
Parsing the Body
Now we’ll write a simple body parser. This will have a similar pattern to the header parser, except that we are checking for the “end of the input” using the eof parser, rather than checking for a carriage return.
parseBody :: Parser (Maybe ByteString)
parseBody = (eof >> return Nothing) <|> do
b <- mkBS (manyTill anySingle eof)
return $ Just b
We use the alternative operator <|> so that we can return Nothing in the case of an empty body, and Just in the case of a body.
This approach would be dubious on a production server, since another incoming request could prevent us from reading eof from the socket. We’d include the second request as part of the first’s body! In reality, you would actually want to use the Content-Length header in the request to parse a specific number of characters in the body.
With this, we can now complete our request parser!
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = do
m <- parseMethod
sp
u <- parseUri
v <- parseHttpVersion
crlf'
h <- parseHeaders HM.empty
b <- parseBody
return $ HttpRequest m u v h b
Complete Parser Code
Here’s all the code we worked on this week:
o :: Char -> Word8
o c = fromIntegral (ord c)
sp :: Parser ()
sp = void $ single (o ' ')
nonSp :: Parser Word8
nonSp = anySingleBut (o ' ')
crlf' :: Parser ()
crlf' = void $ do
single (o '\r')
single (o '\n')
digitW :: Parser Word8
digitW = (\x -> x - (o '0')) <$> satisfy (\c -> c >= (o '0') && c <= (o '9'))
mkBS :: Parser [Word8] -> Parser ByteString
mkBS = fmap BS.pack
parseTillCrlf :: Parser ByteString
parseTillCrlf = mkBS (someTill (anySingleBut (o '\r')) crlf')
parseMethod :: Parser HttpMethod
parseMethod = choice parsers
where
pairs :: [(ByteString, HttpMethod)]
pairs =
[ ("OPTIONS", HttpOptions)
, ("GET", HttpGet)
, ("HEAD", HttpHead)
, ("POST", HttpPost)
, ("PUT", HttpPut)
, ("DELETE", HttpDelete)
, ("TRACE", HttpTrace)
, ("CONNECT", HttpConnect)
]
p :: (ByteString, HttpMethod) -> Parser HttpMethod
p (s, m) = string s >> return m
parsers :: [Parser HttpMethod]
parsers = map p pairs
parseUri :: Parser ByteString
parseUri = mkBS $ someTill nonSp sp
parseHttpVersion :: Parser (Word8, Word8)
parseHttpVersion = do
string "HTTP/"
d1 <- digitW
single (o '.')
d2 <- digitW
return (d1, d2)
parseHeaders :: HM.HashMap ByteString ByteString -> Parser HttpHeaders
parseHeaders prev = (crlf' >> return (HttpHeaders prev)) <|> do
headerName <- mkBS (someTill (anySingleBut (o ':')) (single $ o ':'))
headerBody <- parseTillCrlf
parseHeaders (HM.insert headerName headerBody prev)
parseBody :: Parser (Maybe ByteString)
parseBody = (eof >> return Nothing) <|> do
b <- mkBS (manyTill anySingle eof)
return $ Just b
parseHttpRequest :: Parser HttpRequest
parseHttpRequest = do
m <- parseMethod
sp
u <- parseUri
v <- parseHttpVersion
crlf'
h <- parseHeaders HM.empty
b <- parseBody
return $ HttpRequest m u v h b
Conclusion
Next week, we’ll take this parser and incorporate it into a simple but functional HTTP server! We’ll just need to add some serialization logic and network mechanics.
If you want to learn how to use Megaparsec to parse anything in Haskell, sign up for our course, Solve.hs! Module 4 of this course will teach you how to write a parser from scratch, as well as deal with common data formats, including HTML!