Note to self: Western European Windows ASCII extensions come from code page 1252

I’ve just spent more hours than were available trying to work out how to get the StreamReader in my book-parsing C# code to recognise characters beyond ASCII. Here are some notes so I don’t spend so long next time!

  1. ASCII is seven bit so the characters covered have encodings 0 / 00 / 0000000 through to 127 / 7F / 1111111 (decimal / hexadecimal / binary). Wikipedia has a useful lookup table here: http://en.wikipedia.org/wiki/ASCII
  2. I’ve been reading in a text file that contains characters beyond this range. For example em dash (—), ellipses (…), quotation marks (“”, not the straight ones), etc
  3. Since computers usually use eight bits to store ASCII characters the additional space (0x80 through 0xFF) is used to encode such additional characters.
  4. But how does I tell which code page I am using? I could see the symbols when I opened the text files in word, notepad, or Visual studio. I could even tell the hex representation (from browsing ‘Insert Symbol’ in Word or opening the text file in Visual Studio’s binary editor), but which ASCII extension was it, and how could I get my C# code to use it?
  5. The library documentation for the .Net System.Text.Encoding class contains a long list of supported encodings and their associated code pages. Half-way down was a hopeful looking “Windows-1252 Western European (Windows)”
  6. Now things heated up. Wikipedia has a page showing the encodings in code page Windows-1252 and ‘Bingo!’, the character I was looking for (em dash) had the encoding I was seeing (0x97).
  7. Finally a quick search for “csharp StreamReader 1252” revealed the code I needed. I replaced
         using (this.bookStream = new StreamReader(fileName))
    with
         using (this.bookStream = new StreamReader(fileName, Encoding.GetEncoding(1252)))
    and all works fine!
Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: