FsRegEx


On the Futility of Email Regex Validation

Using regular expressions to validate email addresses is notoriously error prone. The complexity of the address format specification makes constructing a machine readable grammar correctly discerning the difference between valid and invalid addresses hard.

Evolution of the Internet Message Format

RFC 561 Standardizing Network Mail Headers, September 5, 1973

RFC 680 Message Transmission Protocol, April 30, 1975

RFC 724 Proposed Official Standard for the Format of ARPA Network Messages, May 12, 1977

RFC 733 STANDARD FOR THE FORMAT OF ARPA NETWORK TEXT MESSAGES, November 21, 1977

Obsoletes: RFC 561, RFC 680, RFC 724

RFC 822 STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES, August 13, 1982

Obsoletes: RFC 733

RFC 2822 Internet Message Format, April 2001

Obsoletes: RFC 822

RFC 5322 Internet Message Format, October 2008

Obsoletes: RFC 2822

Testing Regular Expressions that Parse Email Addresses

The tests are derived from a suite of tests available at RFC 822 Email Address Parser in PHP, which we have converted to run under FsRegEx. The PHP parser author claims a much better success rate than we were able to achieve with any regular expression, although still not perfect. The site also provides more background on the "muddy" internet specification.

The suite consists of 280 validataion tests, which we converted to Expecto tests here. Run with the Email.Tests console app.

We chose some email parsing regular expressions available on the internet for testing:

Simple, from emailregex.com.

Moderate, from emailregex.com.

Complex, from emailregex.com.

Ultimate, from Mail::RFC822::Address: regexp-based address validation. This regular expression was machine generated.

Note that emailregex.com headlines "Email Address Regular Expression That 99.99% Works". The site offers many different email address regular expressions. Apparently we never found the one that works 99.99% on our test suite.

We also tested a much simpler (non-regular expression) checker, that simply tests string for a single at sign that is not wrapped in quotes and is not either the first or last character in the string here.

Results

Simple:

132 passed, 148 failed, 98 false negatives

Moderate:

190 passed, 90 failed, 72 false negatives

Complex:

172 passed, 108 failed, 60 false negatives

Ultimate:

170 passed, 110 failed, 11 false negatives

One At Sign:

162 passed, 118 failed, 1 false negative (test166)

Conclusion

The evolution of the Internet Email Address specification and way it is documented make it impossible to settle on a completely deterministic grammar covering all cases. With that in mind it is already impossible to make any complete parser, let alone one based on a regular expression. That being said, the test results do not reflect the likelihood of meeting any of the 280 test cases in the wild.

Simply determining if there is a correctly placed at sign is perhaps the best filter for most practical purposes, given the importance of this criterion. (Note that even our attempt at rejecting multiple at signs resulted in a false negative.)

Fork me on GitHub