HTML to OOXML Pipeline

Table of Contents

1. Pipeline Overview
2. Element Mapping
3. Style Conversion
4. Usage
5. Limitations

Uniword provides an HTML import pipeline that converts HTML content into OOXML document parts. This enables importing web content, rich text from web editors, and HTML-formatted data into Word documents.

1. Pipeline Overview

The HTML-to-OOXML conversion follows these steps:

Parse HTML — The HTML input is parsed into a DOM tree
Map elements — HTML elements are mapped to OOXML equivalents
Convert styles — CSS inline styles and class-based styles are converted to OOXML formatting
Handle images — Base64-encoded or linked images are embedded as document parts
Build document — The converted elements are assembled into the document structure

2. Element Mapping

HTML elements map to OOXML elements as follows:

HTML OOXML

HTML	OOXML
`<p>`	`<w:p>` (Paragraph)
`<h1>` - `<h6>`	`<w:p>` with heading styles
`<strong>`, `<b>`	`<w:r>` with `<w:b>` (Bold)
`<em>`, `<i>`	`<w:r>` with `<w:i>` (Italic)
`<u>`	`<w:r>` with `<w:u>` (Underline)
`<ul>`	`<w:p>` with bulleted list numbering
`<ol>`	`<w:p>` with numbered list numbering
`<table>`	`<w:tbl>` (Table)
`<img>`	`<w:drawing>` (Inline image)
`<a>`	`<w:hyperlink>` (Hyperlink)
`<br>`	`<w:br>` (Break)

<p>

<w:p> (Paragraph)

<h1> - <h6>

<w:p> with heading styles

<strong>, <b>

<w:r> with <w:b> (Bold)

<em>, <i>

<w:r> with <w:i> (Italic)

<u>

<w:r> with <w:u> (Underline)

<ul>

<w:p> with bulleted list numbering

<ol>

<w:p> with numbered list numbering

<table>

<w:tbl> (Table)

<img>

<w:drawing> (Inline image)

<a>

<w:hyperlink> (Hyperlink)

<br>

<w:br> (Break)

3. Style Conversion

CSS properties are converted to OOXML formatting properties:

font-size maps to <w:sz> (half-points)
color maps to <w:color> (hex RGB)
font-family maps to <w:rFonts>
text-align maps to <w:jc> (justification)
background-color maps to <w:shd> (shading)

4. Usage

# Import HTML into a document
doc = Uniword::Document.new
importer = Uniword::HtmlImporter.new(doc)
importer.import('<h1>Title</h1><p>Paragraph text</p>')
doc.save('output.docx')

5. Limitations

Not all HTML features have OOXML equivalents:

CSS positioning (position: absolute) is not directly supported
CSS flexbox/grid layouts have no OOXML counterpart
JavaScript-generated content cannot be imported
Complex nested lists may require post-processing