« Word add-in supporting posting to Dottext from Word | Main | Copy Constructors, MemberwiseClone, and the ICloneable Interface »

Getting correctly encoded HTML from Microsoft Word in a .NET implemented COM Add-in

In the context of a COM Add-in for Microsoft Word 2003, I wanted to convert selected content in a Word document to an HTML string for shipment up to a web service.

The first approach was, in C#:

                  Selection.Cut();

                  IDataObject cdo = Clipboard.GetDataObject();

                  body = (string) cdo.GetData(DataFormats.Html);

Which fails on things like the copyright symbol and smart quotes. It appears that the wrong encoding is being applied to the clipboard data on the way to creating what should be a Unicode string.

To get a bit more control, the second approach saved a temporary document as FilteredHTML and then read it back in as a string.

Looking at the content of the temporary files it appeared that they were being encoded using the windows-1252 character set. Since UTF-8 and Unicode encodings were failing, I leaped to the conclusion that the code page 1252 encoding might do the trick and voila:

                  using (FileStream fs = File.OpenRead(Path.Combine(path, filename))) {

                        byte[] bytes = new byte[fs.Length];

                        fs.Read(bytes, 0, (int)fs.Length);

                        html = Encoding.GetEncoding(1252).GetString(bytes);

                  }

 

This produces a Unicode string “html” which gets symbols such as €£¥©®™±≠≤≥αβ right.

 

Comments

thanx

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)