I’ve just spent more hours than were available trying to work out how to get the StreamReader in my book-parsing C# code to recognise characters beyond ASCII. Here are some notes so I don’t spend so long next time!
- ASCII is seven bit so the characters covered have encodings 0 / 00 / 0000000 through to 127 / 7F / 1111111 (decimal / hexadecimal / binary). Wikipedia has a useful lookup table here: http://en.wikipedia.org/wiki/ASCII
- I’ve been reading in a text file that contains characters beyond this range. For example em dash (—), ellipses (…), quotation marks (“”, not the straight ones), etc
- Since computers usually use eight bits to store ASCII characters the additional space (0x80 through 0xFF) is used to encode such additional characters.
- But how does I tell which code page I am using? I could see the symbols when I opened the text files in word, notepad, or Visual studio. I could even tell the hex representation (from browsing ‘Insert Symbol’ in Word or opening the text file in Visual Studio’s binary editor), but which ASCII extension was it, and how could I get my C# code to use it?
- The library documentation for the .Net System.Text.Encoding class contains a long list of supported encodings and their associated code pages. Half-way down was a hopeful looking “Windows-1252 Western European (Windows)”
- Now things heated up. Wikipedia has a page showing the encodings in code page Windows-1252 and ‘Bingo!’, the character I was looking for (em dash) had the encoding I was seeing (0x97).
- Finally a quick search for “csharp StreamReader 1252” revealed the code I needed. I replaced
using (this.bookStream = new StreamReader(fileName))
using (this.bookStream = new StreamReader(fileName, Encoding.GetEncoding(1252)))
and all works fine!