When the length of a database field can be misleading

The length of a database field, at least on MS SQL Server, is not what many people think it is.

Let’s say that I have a 10 character field: nvarchar(10):
table structure

Surely we can insert a row with 5 chars in this field:
Insert 5 chars
and the result is the expected one:
Successful run

Let’s try to insert a 6’Th character, it should work fine, right?
Adding 6 chars

Well, not really 🙂

Run failure

What happened?

By default, SQL Server is using UCS-2 encoding for nvarchar columns.
UCS-2 represent each character on 16 bits (2 bytes) – 65,536 chars should be enough for everybody, right? 🙂
Well, not exactly 🙂 Since 2001, many more characters were added to the Unicode standard, reaching a total of 120,737 chars today (2015, Unicode 8.0). These clearly can’t be represented on only 2 bytes, so 3 or 4 are needed.

In our case, A, B, C, D… are not the letters from the latin alphabet, but… ‘MATHEMATICAL BOLD CAPITAL A, B, C…’: http://unicode.org/cldr/utility/character.jsp?a=1D400

In UTF-16, this is represented on 4 bytes as: 0xD835 0xDC00 (hexa).
MS SQL Server will happily accept it, but by default will consider it as 2 chars. The same happens in .NET Framework, that will return the length 12 for the above string:
Get the string length

Length in .NET

Advertisements
This entry was posted in .NET, SQL Server and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s