Wednesday, February 24, 2010 3:18 PM
by Peter Tyrrell
MS Word uses characters from the Windows-1252 character encoding set which are not represented in ASCII or ISO-8859-1. This is often a pain in the butt. Special characters include:
- the… ellipsis
- ‘smart’ “quotes”
- en – dash and em — dash
- dagger † and double dagger ‡
- and more, but these are most common.
If you want to replace them with ASCII cognates, here's a function to do that. (Daggers don't have cognates as far as I know.)
Javascript
/// Replaces commonly-used Windows 1252 encoded chars that do not exist in ASCII or ISO-8859-1 with ISO-8859-1 cognates.
var replaceWordChars = function(text) {
var s = text;
// smart single quotes and apostrophe
s = s.replace(/[\u2018|\u2019|\u201A]/g, "\'");
// smart double quotes
s = s.replace(/[\u201C|\u201D|\u201E]/g, "\"");
// ellipsis
s = s.replace(/\u2026/g, "...");
// dashes
s = s.replace(/[\u2013|\u2014]/g, "-");
// circumflex
s = s.replace(/\u02C6/g, "^");
// open angle bracket
s = s.replace(/\u2039/g, "<");
// close angle bracket
s = s.replace(/\u203A/g, ">");
// spaces
s = s.replace(/[\u02DC|\u00A0]/g, " ");
return s;
}
C# extension method
public static string ReplaceWordChars(this string text)
{
var s = text;
// smart single quotes and apostrophe
s = Regex.Replace(s, "[\u2018|\u2019|\u201A]", "'");
// smart double quotes
s = Regex.Replace(s, "[\u201C|\u201D|\u201E]", "\"");
// ellipsis
s = Regex.Replace(s, "\u2026", "...");
// dashes
s = Regex.Replace(s, "[\u2013|\u2014]", "-");
// circumflex
s = Regex.Replace(s, "\u02C6", "^");
// open angle bracket
s = Regex.Replace(s, "\u2039", "<");
// close angle bracket
s = Regex.Replace(s, "\u203A", ">");
// spaces
s = Regex.Replace(s, "[\u02DC|\u00A0]", " ");
return s;
}