About MS Word Generated HTML

Word and HTML

Originally, MS Word was enjoying it's position as mistress in the field of word processing. As the rise of Internet, Microsoft couldn't endure its absence in this area, it invaded in, not only bundled it's Internet Explorer into Windows, also wanted its word processor - Word become a web page composer, thus the worst web page composer and HTML generator emerged.

From Word 97, Microsoft initialize its conversion from DOC to HTML using "Save as web page" in the menu. Its result HTML is not professional, and it will lost most original appearance in Word, but comparing to its later doings, it can be tolerated.

Word 2000 make things worse. Seeing the rising of CSS, MS not only followed the fashion again, it even abused it - styles overran Word HTML. The result HTML with styles and Office specific tags made the file very fat, complex. The code could be called "nasty" by experts. Though it remians the editing ability from Word, it's not most users want. Facing the critism, Microsoft had to release a patch tool - Office 2000 HTML Filter, which can strip Office specific tags out. But it did nothing to the remaining junk.

In Word 2002 (XP), Microsoft embedded this funtion inside Word, as a new item in menu - "Save as Web Page (filtered)". This function was kept but also not improved in Word 2003, and to Word 2007

In Word 2007, it only add a new function of "Publish to Blog", which simplifies the procession from DOC to internet fro those who want use Word as their Blog editor. That's just a pick up of old Word 97 style HTML.

Till now, more than ten years passed, MS Word never becomes a good web page compser nor good HTML generator. Its way of generating HTML haven't imporved for about ten years. Will Microsoft improve this in the coming Word 2009? I don't know.

Here's a table of save as web page ability of all Words from 95 to 2007

Word Version 95 97 2000 2002 (XP) 2003 2007
Save as - HTML
  •  
  • Save as Web Page
  •  
  •  
  •  
  •  
  • Save as Web Page(filtered)
  •  
  •  
  •  
  • Publish to blog
  •  
  • Nasty Word HTML

    For unknown reason, the quality of HTML that Word genrates hardly improved. Exactly, the basic way Word generates HTML has never changed since Word 2000. Fat, redandunt, complex, even foolish.

    Here're some comments of Word genrated HTML from some professional experts.

    Word offers two HTML options in its save dialog: "Save as HTML" and "Save as Filtered HTML". In practice, that means you get to choose between totally nasty HTML and slightly less nasty HTML. by Jeff Atwood, the founder of Coding Horror

    Microsoft Word generates terrible, sloppy, bloated, proprietary HTML. It's ugly and near-impossible to hand-edit. by Jotsheet

    Anyone who's had to convert Word docs to HTML knows it can be a real pain in the you-know-what. by Keith Robinson

    The "Save as web page"-function in Microsoft Word generates files that are about ten times as large as they should be. by Morten Nilsson the author of Microsoft Word 2000 HTML Mess Cleaner

    Actually, the HTML generated by MS Word 97 is rather concised, it's just HTML, though it's redundant HTML. But from Word 2000, Word began to generated HTML files with stylesheet, which always contain useless Office specific tags and more redundant HTML codes. Though Office HTML filter, and "Save as web page (filtered)" can filter Office specific tags, but the redundancy, especially endless repeated styles still remians.

    Take a piece of Word HTML, you will see there're always inline styles in every tag, no matter it repeated how many times, no matter it's useful or not. there're the same value of height and width in every table cell, and these values are repeated declared in both styles and HTML codes.

    To professional HTML coders, this style of coding is unacceptable, altogether garbage. It no only makes large file size and wastes lots of space, but make the HTML file slow to view and most important, slow to transfer thourgh internet. Maybe you can't see the rubbish inside, but they always occupy your space, maybe you doesn't feel the slow speed for your high speed network, but your network and CPU always spend half of their time to transfer and analysis those rubbish.

    Reference, an section from the article Eliminate clutter in Microsoft Word generated HTML files with the Office 2000 HTML Filter by John Sheesley, which describes why Word HTML become nasty.

    What's wrong with Word?

     Microsoft Word does a great job as a word processor, but it's not very useful for creating HTML documents that you can quickly plug into a Web site. When you a Word document as HTML, Word adds page- formatting tags that can make the document very large. These page-formatting tags may also cause content management programs and Web sites to behave unexpectedly.

    Microsoft added the special tags to Word's HTML with an eye toward backward compatibility. Microsoft wanted you to be able to save files in HTML complete with all of the tracking, comments, formatting, and other special Word features found in traditional DOC files. If you save a file in HTML and then reload it in Word, theoretically you don't loose anything at all.

     Unfortunately, when you then move a standard Word-generated HTML file to a Web site, bad things can sometimes happen. Formatting tags included in the Word file can conflict with settings on a Web server, causing the document to display incorrectly. Additionally, a browser may misinterpret the tags and display the file incorrectly. The HTML file also contains versioning and authoring information that you may not want to have appearing on a Web site.

    So, why not root those junk out?

    We need tools to clean them. Fortunately, there are. Next