On assumptions and formats

In .NET (and any other framework for that matter), it’s better to never assume anything, but to check twice.
Let’s take an example – what do you think, will the following unit test always pass?

[TestMethod]
public void ShortDateLength()
{
    DateTime d = new DateTime(2015, 01, 18);
    string dateString = d.ToString("yyyyMMdd");

    Assert.AreEqual(8, dateString.Length);
}

.
.
.
.
.
.
.
.
.
.
.
.
.
.
… well, usually, it will, but once every blue moon, it will fail 🙂 .
All it takes it’s an user changing the regional settings of the computer, or adding the following 3 lines of code:

[TestMethod]
public void ShortDateLength()
{
    CultureInfo c = new CultureInfo("he-IL", false);
    c.DateTimeFormat.Calendar = new HebrewCalendar();
    Thread.CurrentThread.CurrentCulture = c;

    DateTime d = new DateTime(2015, 01, 18);
    string dateString = d.ToString("yyyyMMdd");

    Assert.AreEqual(8, dateString.Length);
}

Yes, on some cultures and calendars, the dates are not represented in arab numerals, and years might not fit in 4 chars.
In the above case, dateString will have the following ‘unexpected’ value:
תשע”הד’כ”ז

– a solid 10 chars in length.
When does this matter? When dateString is going to be displayed on the UI, probably not – in such cases I want to be formatted in the format chosen by the end-user.
However, if the DateTime value is going to be serialized to a text file or send to a web service, I want to make sure that I will be able to decode it later.
In such cases it’s better to replace the line 9 above with:
string dateString = d.ToString(“yyyyMMdd”, CultureInfo.InvariantCulture);

Somehow related – what do you think, which of the following tests will pass?

const string digits1 = "5678";
Assert.IsTrue(Regex.IsMatch(digits1, @"^\d+$"));

const string digits2 = "୮౪୩";
Assert.IsTrue(Regex.IsMatch(digits2, @"^\d+$"));

Unexpectedly for some, both will pass 🙂
\d in .NET matches any digit, and is Unicode-aware (https://msdn.microsoft.com/en-us/library/20bw873z(v=vs.110).aspx#DigitCharacter).
୮౪୩ are.. digits in some cultures (http://www.fileformat.info/info/unicode/category/Nd/list.htm).

Again, why is this relevant? Because many developers use \d as a quick way to validate input that it’s supposed to be only one of 0,1,2,…,9 – well, digit might mean more than that.
Some will say that in their application there is no risk that an end-user might enter something like ୮౪୩ – true, unless the input comes from a mis-behaving client application that calls a web service, and it just happen to send by accident the following sequence of bytes (hex values):
EF BB BF E0 AD AE
or
E0 B1 AA
or
E0 AD A9
– in UTF-8 these are.. digits.

Advertisements
This entry was posted in .NET, C# and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s