Reducing bugs with tight types

A few years ago a customer found a nasty bug in some of our python software. The short version of the problem was that an out-of-range value was being passed from the user interface to low level software. The details are a little more complicated but worth a brief examination as they demonstrate how specification mismatches can occur.

The program was designed to allow a user to enter a SIP URI – a video endpoint address – which the program would connect to. Typically this would be a domain name or occasionally an IP address. On this occasion the user entered a domain name, a colon and then a telephone number for complicated, but understandable, reasons.1

A SIP URI is defined, in part, as

 

SIP-URI = "sip:" [ userinfo ] hostport uri-parameters [ headers ]
...
hostport = host [ ":" port ]
port = 1*DIGIT

Note that optional :port at the end of the hostport definition. This is intended to represent the TCP/IP port number and in the text based SIP spec is defined as one or more numeric characters. Unfortunately in the TCP/IP spec it is defined as an unsigned 16 bit integer, giving a range of 0 – 65,535. The telephone number our beloved user was entering far exceeded that value.

Our bug was that, although we ensured that port was valid according to the SIP spec, we did not also enforce validity according to the TCP spec until the attempt to make a TCP connection to the URI. This caused a low level OverflowError to be thrown. At this stage there was nothing useful the progam could do with that error; the low level code had no context in which to handle it and the high level code had no appropriate error handler: validation and acceptance of the user generated configuration was long over. There was a gap in our validation that matched the gap between the two specifications.

There’s arguably another bug here: the python documentation for socket.create_connection refers to a (host, port) tuple but makes no mention of what port is. That information is to be found under the documentation for socket families where it states:

A pair (host, port) is used for the AF_INET address family, where host is a string representing either a hostname in internet domain notation like ‘daring.cwi.nl’ or an IPv4 address like ‘100.50.200.5’, and port is an integer.

That “[…] and port is an integer” statement hides the requirement that port fits into a 16 bit integer. Without networking domain knowledge you’re left to try it and see.

Our fix at the time was to add more validation code around acceptance of user generated URIs and to beef up our test cases. Bug fix applied, happy customer and several years later we have had no recurrence of this issue.

That sounds like a fairly uninteresting war story. But the worry is contained in that final sentence: “no recurrence”. There are many ways our software receives user generated URIs, many of them from third party devices out of our control. So there are many possible points of failure, each of which has to be checked. As we have already proved, this process is error prone.

What can we do to improve this situation, to make it impossible for the error to occur? The original code was written in python, but with my current interest in Haskell it’s not surprising that my solution is through the type system. The underlying idea is to make impossible states impossible to represent. If there’s no way of saying port = 123456789 then problem solved.

Many years ago I was writing software in Pascal and came across Pascal’s idea of subrange types. This allows you to constrain a value to a subrange of a particular data type. Simple cases of violation can be detected at compile time, others at runtime with a runtime error being thrown if the optional runtime checking is enabled.

 

{$RangeChecks On}
program Subrange;

var
        port : 1 .. 65535;
        myInteger : integer;

begin
        myInteger := 123456789;
        port := myInteger;  // throws a runtime error
        // port := 123456789; // Compile time error
end.

This approach doesn’t provide much improvement over the version in the original python. It will simply throw a runtime error when an out of range user-generated value is assigned to the port2. That error has to be anticipated and handled by the higher level software.

Returning to my interest in solving this with Haskell: the language doesn’t support subrange types3, but the normal work around is to provide a type with a hidden constructor and a factory function which validates the input.

The problem then becomes what to do if the input is invalid? Throwing an exception just returns us to the problem of how to make sure the high level code handles the problem, as with the pascal example above.

Haskell’s idiomatic answer to this is to return a Maybe Port. The caller then pattern matches on the success and failure branches. The Haskell compiler will warn of incomplete matches if the failure case is not handled. But that raises the question of how to handle failure.

 

import Data.Word

newtype Port = Port
  {
    port :: Word16
  }
  deriving (Bounded, Eq, Ord, Show)

maybePort :: Integral i => i -> Maybe Port
maybePort i 
  | withinBounds = Just $ Port $ fromIntegral i
  | otherwise = Nothing
  where
    withinBounds  =
      toInteger i >= toInteger (minBound::Word16) &&
      toInteger i <= toInteger (maxBound::Word16)


connect :: Port -> IO ()
connect _ = do
  print "This is where we make a TCP connection"


-- For now we'll just assume that the input String to exampleUsage
-- consists only of digits.  That obviously opens us up to other
-- errors if that String is malformed so again we'll want a smart
-- constructor.  Skipping that for now and just aliasing String to get
-- the general point across.
type DigitString = String

exampleUsage :: DigitString -> IO()
exampleUsage userInput = do
  case maybePort (read userInput) of
    Nothing -> print "Handle incorrect user input"
    Just port -> connect port

Imagine a three layer architecture: the code getting the user’s input, some business logic and the code making the TCP connection request. The author of the TCP connection code hopefully has a good knowledge of the requirements for a port to be valid and encompasses them in the definition of Port which has to be used in order to invoke the connect function. The business logic doesn’t know how to handle an invalid port so requires the user interface layer to provide a valid Port. At the UI level, handling an invalid port number is actually a valid state – it’s something that can occur and we should be prepared to handle. Whether or not a port number is valid is easy to determine – try to construct a Port instance from it. Since the user interface layer is close to the source of the problem it is much easier to handle: reject the configuration.

It’s obviously still possible to fail: you may choose to match on the failure case and explicitly convert it to an exception, but at least you’ll feel bad about doing that.

When I decided to start pulling on the thread of this particular problem, I thought I ought to look into how various Haskell libraries handle this particular case. I came across a few modules that implemented different port definitions.

There are implementations in terms of Maybe Int or Maybe Integer which repeat the error of the python implementation in a language famous for it’s desire to avoid primitive obsession. More promising is a lower level module, Network.Socket, used in the popular yesod web server. This wraps a Word16, so faithfully echoing the TCP spec:

newtype PortNumber = PortNum Word16 deriving (Eq, Ord, Num, Enum, Bounded, Real, Integral)

My concern here is that the constructor is exposed so there is no explicit check on the value used for the port and Word16 will silently wrap if given too large a value:

 

λ> 123456789 :: PortNumber
52501

This will result in a silent failure of the connection attempt, not even a low level exception. Whilst it could be argued that the programmer has been explicitly informed that they are putting a value into a 16 bit word and should be aware of the risks, I prefer to provide basic safety rails (private constructor) and appropriate tools (range checking smart constructor).

As we’ve seen, tightening up on the types used to express a value can eliminate the potential for invalid values. This is effectively a compile time version of Bertrand Meyer’s pre-conditions in his Design by Contract approach4. For a balancing view on not going overboard on type safety, Michael Snoyman’s blog is worthwhile reading.

Footnotes:

1

They were having trouble connecting – the networks were intentionally partitioned for security reasons. In an attempt to connect they tried a number of variations on non-standard dial strings allowed by some endpoints.

2

It’s actually worse than that. Some pascal implementations such as Free Pascal don’t check for range overflow by default. You think you’ve got a safety net only to discover it is not connected to anything.

3

Liquid Haskell supports refinement types which are a superset of subrange types, but that’s sledgehammer to crack a nut territory .

4

  • Meyer, B. (2009). Object Oriented Software Construction. : Prentice Hall.