Andornot Blog

Wednesday, February 24, 2010 3:18 PM

Replace MS Word special characters in javascript and C#

by Peter Tyrrell

MS Word uses characters from the Windows-1252 character encoding set which are not represented in ASCII or ISO-8859-1. This is often a pain in the butt. Special characters include:

  • the… ellipsis
  • ‘smart’ “quotes”
  • en – dash and em — dash
  • dagger † and double dagger ‡
  • and more, but these are most common.

If you want to replace them with ASCII cognates, here's a function to do that. (Daggers don't have cognates as far as I know.)

Javascript

/// Replaces commonly-used Windows 1252 encoded chars that do not exist in ASCII or ISO-8859-1 with ISO-8859-1 cognates.
var replaceWordChars = function(text) {
    var s = text;
    // smart single quotes and apostrophe
    s = s.replace(/[\u2018|\u2019|\u201A]/g, "\'");
    // smart double quotes
    s = s.replace(/[\u201C|\u201D|\u201E]/g, "\"");
    // ellipsis
    s = s.replace(/\u2026/g, "...");
    // dashes
    s = s.replace(/[\u2013|\u2014]/g, "-");
    // circumflex
    s = s.replace(/\u02C6/g, "^");
    // open angle bracket
    s = s.replace(/\u2039/g, "<");
    // close angle bracket
    s = s.replace(/\u203A/g, ">");
    // spaces
    s = s.replace(/[\u02DC|\u00A0]/g, " ");
    
    return s;
}

C# extension method

public static string ReplaceWordChars(this string text)
        {
            var s = text;
            // smart single quotes and apostrophe
            s = Regex.Replace(s, "[\u2018|\u2019|\u201A]", "'");
            // smart double quotes
            s = Regex.Replace(s, "[\u201C|\u201D|\u201E]", "\"");
            // ellipsis
            s = Regex.Replace(s, "\u2026", "...");
            // dashes
            s = Regex.Replace(s, "[\u2013|\u2014]", "-");
            // circumflex
            s = Regex.Replace(s, "\u02C6", "^");
            // open angle bracket
            s = Regex.Replace(s, "\u2039", "<");
            // close angle bracket
            s = Regex.Replace(s, "\u203A", ">");
            // spaces
            s = Regex.Replace(s, "[\u02DC|\u00A0]", " ");
            
            return s;
        }
blog comments powered by Disqus

Month List