2006-05-02

Workarond for DataObject Html corruption - Permalink

Summary

Some months ago I found that if some html data is extracted from a DataObject during a Drag&drop or Clipboard operation, all the unicode characters present in that data will be corrupted by the .NET Framework version 2.0.50727.42. I reported this bug to Microsoft and now, prompted by this thread, I present a temporary workaround based on p/invoke until the CLR team fixes the problem. Here's the link to the DataObject Bug - Microsoft Feedback page that I filed. There is a simple project attached to the explanations. See the source code for more insight about that bug.

Background information

When we copy or drag a fragment of a web page from Internet Explorer, WebBrowser (or any other control implementing MSHTML), this fragment is stored in a IDataObject in various formats. Tipically, one of these is plain text, and another is "HTML Clipboard Format". This second type has the ability to retain the original formatting of the copied data, bold words, tables and even css styles. How can this be accomplished?
If we copy a piece of a phrase, MSHTML surrounds it with the correct opening and closing html tags even if they are not present in our selection; more that that, it actually creates a new tiny webpage that is a subset of the original page. It adds the corresponding head, styles, body and comments wrapping our fragment. After that, the data is stored in the DataObject box.
As an example, If we try to drop a fragment of an html page on a TextBox with a Drag&Drop operation, we can check this format in the DragEnter event. In this snippet we check if the DataObject contains html formatted data and if yes, we activate and display the "copy" icon under our mouse. In the Framework there is the DataFormats static class providing for us predefined clipboard format names, in this case we need DataFormats.Html

private void textBox1_DragEnter(object sender, DragEventArgs e)
{
     if (e.Data.GetDataPresent(DataFormats.Html) == true)
         e.Effect = DragDropEffects.Copy;
}

When the mouse button is released on the TextBox, the DragDrop event is fired and we can extract the data:
private void textBox1_DragDrop(object sender, DragEventArgs e)
{
    if (e.Data.GetDataPresent(DataFormats.Html) == true)
    {
        string droppedHtml = e.Data.GetData(DataFormats.Html) as string;

        textBox1.Text = droppedHtml;
    }
}

After extracting the string from the DataObject we parse it. This is an example of a raw string:
Version:1.0
  StartHTML:000000174
  EndHTML:000000337
  StartFragment:000000289
  EndFragment:000000301
  StartSelection:000000289
  EndSelection:000000301
  SourceURL:about:blank
  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
  
  <HTML><HEAD></HEAD>
  
  <BODY><!--StartFragment-->Dropped text<!--EndFragment--></BODY>
  </HTML>

Notice the headers before the actual html data (notice even an useful SourceURL parameter). The meaning of all these items is defined here: The HTML Clipboard Format. Here is a little explanation quoted from that page.
Version vv Version number of the clipboard. Starting version is 0.9.
StartHTML Byte count from the beginning of the clipboard to the start of the context, or -1 if no context.
EndHTML Byte count from the beginning of the clipboard to the end of the context, or -1 if no context.
StartFragment Byte count from the beginning of the clipboard to the start of the fragment.
EndFragment Byte count from the beginning of the clipboard to the end of the fragment.
StartSelection Byte count from the beginning of the clipboard to the start of the selection.
EndSelection Byte count from the beginning of the clipboard to the end of the selection.
The most important fact found in the Html Clipboard format specification is the text enconding: it's always UTF-8. Here is the paragraph:
The only character set supported by the clipboard is Unicode in its UTF-8 encoding. Because the first characters of UTF-8 and ASCII match, the description is always ASCII, but the bytes of the context (starting at StartHTML) may use any other characters coded in UTF-8. Ends of lines may be represented in a clipboard format header as Carriage Return (CR), carriage return/line feed (CR/LF), or Line Feed (LF).

A bug's life

The crucial command in our DataObject operation is the string extraction:
string droppedHtml = e.Data.GetData(DataFormats.Html) as string;

At this point we should get the same html string that wast stored in the DataObject. For the English-speaking users everything will seem ok in many and many cases because the english language barely employs unicode symbols or accented letters, so the decoding looks deceivingly fine. That's why the bug did go unnoticed at MS. There is some overlapping bewteen the ASCII and UTF-8 charset: the first bytes of UTF-8 and ASCII match... but UTF-8 normally has a special preamble called BOM to distinguish it from the ASCII charset. What happens if our data has some Unicode characters? Let's try and copy or drag some grapheme like: "Euro sign: €".
We end up with something like this:
Euro sign: €

The Euro symbol is wrongly decoded as two chars. Why? In many cases an unicode character requires at least two bytes for its representation and that essential information is lost in the exctraction routine.
To understand what just happened we have to do some digging into the Framework inner mechanisms with the invaluable Reflector tool by Lutz Roeder.
Following the path of the DataObject data-extraction methods we find:
//private object GetDataFromHGLOBLAL(string format, IntPtr hglobal);
//Declaring Type: System.Windows.Forms.DataObject+OleConverter 
//Assembly: System.Windows.Forms, Version=2.0.0.0 
private object GetDataFromHGLOBLAL(string format, IntPtr hglobal)
{
    object obj1 = null;
    if (hglobal != IntPtr.Zero)
    {
        if ((format.Equals(DataFormats.Text) || format.Equals(DataFormats.Rtf)) || 
          format.Equals(DataFormats.Html) || format.Equals(DataFormats.OemText)))
        {
            obj1 = this.ReadStringFromHandle(hglobal, false);
        }
        else if (format.Equals(DataFormats.UnicodeText))
        {
            obj1 = this.ReadStringFromHandle(hglobal, true);
        }
        //[...]
    }
}
Behold! We can finally see that our string is just treated as a plain ASCII string; the framework code doesn't do the required conversion, no special treatment whatsoever.

Here is the string reading function, the second parameter of ReadStringFromHandle specifies if the string is in Unicode format.
//private string ReadStringFromHandle(IntPtr handle, bool unicode);
//Declaring Type: System.Windows.Forms.DataObject+OleConverter 
//Assembly: System.Windows.Forms, Version=2.0.0.0 
private unsafe string ReadStringFromHandle(IntPtr handle, bool unicode)
{
    string text1 = null;
    IntPtr ptr1 = UnsafeNativeMethods.GlobalLock(new HandleRef(null, handle));
    try
    {
        if (unicode)
        {
            return new string((char*)ptr1);
        }
        text1 = new string((sbyte*)ptr1); //Here is the problem
    }
    finally
    {
        UnsafeNativeMethods.GlobalUnlock(new HandleRef(null, handle));
    }
    return text1;
}

Once the data is extracted from the DataObject like it was an ASCII string, there is no way to recover the corrupted characters, because relevant information is lost.

A solution

At this point I was frustrated but really wanted to find a solution, so I decided to mimic the framework routines correcting the string extraction section, and that meant calling the underlying Windows APIs through p/invoke. I studied the framewok code and than documented myself about the necessary tools for extracting data from the clipboard handle. I hope this post will be instructive to others.
Fortunately most of these native Windows structures are defined and ready for use in System.Runtime.InteropServices.ComTypes .
The most important aspect in my function is the initial cast from a System.Windows.Forms.IDataObject to the lower-level System.Runtime.InteropServices.ComTypes.IDataObject. These interfaces have the same name but a completely different functionality:
System.Windows.Forms.IDataObject contains the standard .NET framework high level methods for extracting the data.

namespace System.Windows.Forms
{
    [ComVisible(true)]
    public interface IDataObject
    {
        object GetData(string format);
        object GetData(Type format);
        object GetData(string format, bool autoConvert);
        bool GetDataPresent(string format);
        bool GetDataPresent(Type format);
        bool GetDataPresent(string format, bool autoConvert);
        string[] GetFormats();
        string[] GetFormats(bool autoConvert);
        void SetData(object data);
        void SetData(string format, object data);
        void SetData(Type format, object data);
        void SetData(string format, bool autoConvert, object data);
    }
}

System.Runtime.InteropServices.ComTypes.IDataObject contains low-level methods using native data structures like FORMATETC and STGMEDIUM. Whith the aid of these we will extract the data directly from Windows' unmanaged world.

namespace System.Runtime.InteropServices.ComTypes
{
    [InterfaceType(1)]
    [Guid("0000010E-0000-0000-C000-000000000046")]
    public interface IDataObject
    {
        int DAdvise(ref FORMATETC pFormatetc, ADVF advf, IAdviseSink adviseSink, out int connection);
        void DUnadvise(int connection);
        int EnumDAdvise(out IEnumSTATDATA enumAdvise);
        IEnumFORMATETC EnumFormatEtc(DATADIR direction);
        int GetCanonicalFormatEtc(ref FORMATETC formatIn, out FORMATETC formatOut);
        void GetData(ref FORMATETC format, out STGMEDIUM medium);
        void GetDataHere(ref FORMATETC format, ref STGMEDIUM medium);
        int QueryGetData(ref FORMATETC format);
        void SetData(ref FORMATETC formatIn, ref STGMEDIUM medium, bool release);
    }
}

Thanks to this cast we now deal with the native Windows API clipboard structures for extracting raw data. I'm currently investigating if there are advantages using streams and GlobalFree methods, but for now this function is enough. I'd like to receive some comments on its formal correctness and functionality.

The JX workaround code:

using System;
using System.Text;
using System.Diagnostics;
using System.Windows.Forms;
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;



namespace JX
{
    /// <summary>
    /// By JX - Giuliano Sauro
    /// Version 1.0.200511041900   
    /// This static class contains extracts Dataformat.Html data from a IDataObject 
    /// obtained from the Clipboard or a Drag and Drop operation.
    /// The method overcomes the UTF8 corruption problem encountered in .NET Framework version 2.0.50727.42
    /// Version 1.0 of the Dataformat.Html data starts with a header like this:
    /// 
    /// Version:1.0
    /// StartHTML:000000238
    /// EndHTML:000000546
    /// StartFragment:000000353
    /// EndFragment:000000510
    /// StartSelection:000000353
    /// EndSelection:000000510
    /// 
    /// The values refer to the BYTES of the string not the actual graphemes
    /// </summary>
    static class HtmlFromIDataObject
    {
        /// <summary>
        /// Extracts data of type Dataformat.Html from an IdataObject data container
        /// This method shouldn't throw any exception but writes relevant exception informations in the debug window
        /// </summary>
        /// <param name="data">IdataObject data container</param>
        /// <returns>A byte[] array with the decoded string or null if the method fails</returns>
        public static byte[] GetHtml(System.Windows.Forms.IDataObject data)
        {
            System.Runtime.InteropServices.ComTypes.IDataObject interopData = data as System.Runtime.InteropServices.ComTypes.IDataObject;

            FORMATETC format = new FORMATETC();          
            format.cfFormat = (short)DataFormats.GetFormat(DataFormats.Html).Id;
            format.dwAspect = DVASPECT.DVASPECT_CONTENT;
            format.lindex = -1;   
            format.tymed = TYMED.TYMED_HGLOBAL;

            STGMEDIUM stgmedium = new STGMEDIUM();
            stgmedium.tymed = TYMED.TYMED_HGLOBAL;
            stgmedium.pUnkForRelease = null;

            int queryResult = 0;

            try
            {
                queryResult = interopData.QueryGetData(ref format);
            }
            catch (Exception exp)
            {
                Debug.WriteLine("HtmlFromIDataObject.GetHtml -> QueryGetData(ref format) threw an exception: " 
                    + Environment.NewLine + exp.ToString());
                return null;
            }

            if (queryResult != 0)
            {
                Debug.WriteLine("HtmlFromIDataObject.GetHtml -> QueryGetData(ref format) returned a code != 0 code: " 
                    + queryResult.ToString());
                return null;
            }

            try
            {
                interopData.GetData(ref format, out stgmedium);
            }
            catch (Exception exp)
            {
                System.Diagnostics.Debug.WriteLine("HtmlFromIDataObject.GetHtml -> GetData(ref format, out stgmedium) threw this exception: " 
                    + Environment.NewLine + exp.ToString());
                return null;
            }

            if (stgmedium.unionmember == IntPtr.Zero)
            {
                Debug.WriteLine("HtmlFromIDataObject.GetHtml -> stgmedium.unionmember returned an IntPtr pointing to zero");
                return null;
            }

            IntPtr pointer = stgmedium.unionmember;
            
            HandleRef handleRef = new HandleRef(null, pointer);

            byte[] rawArray = null;

            try
            {
                IntPtr ptr1 = GlobalLock(handleRef);

                int length = GlobalSize(handleRef);

                rawArray = new byte[length];

                Marshal.Copy(ptr1, rawArray, 0, length);          

            }
            catch (Exception exp)
            {
                Debug.WriteLine("HtmlFromIDataObject.GetHtml -> Html Import threw an exception: " + Environment.NewLine + exp.ToString());
            }
            finally
            {
                GlobalUnlock(handleRef);

            }

            return rawArray;
        }


        [DllImport("kernel32.dll", CharSet = CharSet.Auto, ExactSpelling = true, SetLastError = true)]
        private static extern IntPtr GlobalLock(HandleRef handle);

        [DllImport("kernel32.dll", CharSet = CharSet.Auto, ExactSpelling = true, SetLastError = true)]
        private static extern bool GlobalUnlock(HandleRef handle);

        [DllImport("kernel32.dll", CharSet = CharSet.Auto, ExactSpelling = true, SetLastError = true)]
        private static extern int GlobalSize(HandleRef handle);
 
    }
}

The class is static. A dataObject (like e.Data in a Drag&Drop operation) must be passed to the GetHtml() method. It returns a Byte array containing an UTF-8 encoded string or null if something went wrong.
byte[] rawHtmlBytes = JX.HtmlFromIDataObject.GetHtml(dataObject)

Once we have the raw byte array, the last thing we have to do is actually decoding the string with:
string rawHtml = Encoding.UTF8.GetString(rawHtmlBytes);

That's it, now we have a correctly decoded string ready for parsing.

Demo

Here is a sample demo application to test my workaround's functionality: download DataObject_HtmlBug_Workaround.zip for Visual Studio 2005
JX - Giuliano Sauro
 



Comments:
Nice workaround.

Thank you, you helped me a lot.

And works fine!!
 
I thought the post made some good points on extracting data, I use python for simple extracting data,data extraction can be a time consuming process but for larger projects like documents, files, or the web i tried "extracting data" which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs
 
it amazes me that still in 2013 and this bug is yet to be corrected.
I have lost so many hours of debugging just to find out that it's a bug MS should have fixed 7 years ago (at least!)
 
...However, the problem still isn't solved.
If I am putting an HTML text in the clipboard that contains international characters and some other application needs to paste it (Office automation for example), I'm screwed, because the other application can't use this code.
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?