On the Futility of Email Regex Validation
Using regular expressions to validate email addresses is notoriously error prone. The complexity of the address format specification makes constructing a machine readable grammar correctly discerning the difference between valid and invalid addresses hard.
Evolution of the Internet Message Format
RFC 561 Standardizing Network Mail Headers, September 5, 1973
RFC 680 Message Transmission Protocol, April 30, 1975
RFC 724 Proposed Official Standard for the Format of ARPA Network Messages, May 12, 1977
RFC 733 STANDARD FOR THE FORMAT OF ARPA NETWORK TEXT MESSAGES, November 21, 1977
Obsoletes: RFC 561, RFC 680, RFC 724
RFC 822 STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES, August 13, 1982
Obsoletes: RFC 733
RFC 2822 Internet Message Format, April 2001
Obsoletes: RFC 822
RFC 5322 Internet Message Format, October 2008
Obsoletes: RFC 2822
Testing Regular Expressions that Parse Email Addresses
The tests are derived from a suite of tests available at RFC 822 Email Address Parser in PHP, which we have converted to run under FsRegEx. The PHP parser author claims a much better success rate than we were able to achieve with any regular expression, although still not perfect. The site also provides more background on the "muddy" internet specification.
The suite consists of 280 validataion tests, which we converted to Expecto tests here. Run with the Email.Tests console app.
We chose some email parsing regular expressions available on the internet for testing:
Simple, from emailregex.com.
Moderate, from emailregex.com.
Complex, from emailregex.com.
Ultimate, from Mail::RFC822::Address: regexp-based address validation. This regular expression was machine generated.
Note that emailregex.com headlines "Email Address Regular Expression That 99.99% Works". The site offers many different email address regular expressions. Apparently we never found the one that works 99.99% on our test suite.
We also tested a much simpler (non-regular expression) checker, that simply tests string for a single at sign that is not wrapped in quotes and is not either the first or last character in the string here.
Results
Simple:
132 passed, 148 failed, 98 false negatives
Moderate:
190 passed, 90 failed, 72 false negatives
Complex:
172 passed, 108 failed, 60 false negatives
Ultimate:
170 passed, 110 failed, 11 false negatives
One At Sign:
162 passed, 118 failed, 1 false negative (test166)
Conclusion
The evolution of the Internet Email Address specification and way it is documented make it impossible to settle on a completely deterministic grammar covering all cases. With that in mind it is already impossible to make any complete parser, let alone one based on a regular expression. That being said, the test results do not reflect the likelihood of meeting any of the 280 test cases in the wild.
Simply determining if there is a correctly placed at sign is perhaps the best filter for most practical purposes, given the importance of this criterion. (Note that even our attempt at rejecting multiple at signs resulted in a false negative.)