Streaming and Performance

Table of Contents

1. Performance Characteristics
2. Design Features
3. Tips for Large Documents
4. Autoload Benefits
5. See Also

Uniword is optimized for performance with large documents through lazy loading, efficient serialization, and optimized ZIP handling.

1. Performance Characteristics

1.1. Document Size Scaling

Document	XML Size	Nodes	ZIP Parse	XML Parse	Deserialize	Total
Typical letter	~50KB	~500	<0.01s	~0.01s	~0.1s	~0.2s
ISO 8601	295KB	—	<0.01s	~0.02s	~0.5s	~0.6s
ISO 690	4.8MB	~130K	0.02s	0.05s	~20s	~20s
ISO DIS 5878	29.4MB	970K	0.05s	0.39s	varies	varies

Document

XML Size

Nodes

ZIP Parse

XML Parse

Deserialize

Total

Typical letter

~50KB

~500

<0.01s

~0.01s

~0.1s

~0.2s

ISO 8601

295KB

—

<0.01s

~0.02s

~0.5s

~0.6s

ISO 690

4.8MB

~130K

0.02s

0.05s

~20s

ISO DIS 5878

29.4MB

970K

0.05s

0.39s

varies

ZIP extraction and Nokogiri XML parsing are consistently fast. The bottleneck is in lutaml-model deserialization, which creates Ruby objects for every XML element and attribute.

1.2. Node Distribution

For the 970K-node ISO DIS 5878 document, the element breakdown is:

Element	Count	Percentage
`<w:tab>`	296,980	30.6%
`<w:rPr>`	63,336	6.5%
`<w:sz>`	55,916	5.8%
`<w:szCs>`	33,780	3.5%
`<w:r>`	33,207	3.4%
`<w:t>`	32,485	3.3%
`<w:jc>`	29,718	3.1%
`<w:divId>`	29,482	3.0%
`<w:p>`	28,283	2.9%
`<w:pPr>`	28,283	2.9%

Tab stops dominate because ISO documents use extensive tab-aligned numbering (~30 tabs per paragraph across ~27K paragraphs).

1.3. Optimizations Applied

The following lutaml-model optimizations have been applied:

#	Optimization
1	Skip `build_input_declaration_plan` for non-root elements
2	Fast array key in `TransformationRegistry` (avoids symbol construction)
3	Fast path in `handle_transform_method` when no transforms present
4	Cached `Namespace#all_uris` as frozen array
5	`allocate_for_deserialization` using `allocate` + `init_deserialization_state`
6	Cache `castable?` at `MappingRule` init; type resolution cached on `Attribute`
7	Cached `extract_register_id` with fast nil-check
8	Fast path skip `ImportTransformer` when rule and attr have no transforms
9	Fast path in `value_map` when options are empty (no nil/empty/omitted)
10	Cache type/namespace class/URI on `Attribute` (`@type_ns_class_cache`, etc.)

These optimizations reduced ISO 690 deserialization from ~52s to ~20s.

2. Design Features

Lazy loading: The 95% autoload strategy ensures only needed classes are loaded. Opening a simple document does not load the 760 element classes.
Efficient serialization: Lutaml-model provides optimized XML serialization with minimized object allocation and string operations.
Optimized ZIP handling: The rubyzip-based ZIP handler reads and writes DOCX packages efficiently, processing parts on demand.
Thread-safe caching: The CachedTypeResolver uses mutex synchronization to ensure thread-safe type resolution. Ruby’s Hash operations are atomic for single reads/writes, but compound operations (||=) require synchronization to avoid wasteful duplicate computation.

3. Tips for Large Documents

For documents with thousands of paragraphs or large embedded images:

Use Document.open instead of loading the entire document into memory when only reading
Process paragraphs in batches rather than materializing the entire collection
Consider splitting very large documents into sections

# Efficient document processing
doc = Uniword.load('large-document.docx')

# Process paragraphs without creating intermediate arrays
doc.paragraphs.each do |para|
  process(para)
end

# Save with optimized serialization
doc.save('output.docx')

4. Autoload Benefits

The 95% autoload strategy provides measurable startup improvements:

90% fewer classes loaded at startup compared to eager loading
Memory footprint scales with actual document complexity
Unused namespaces (math, charts, presentations) stay unloaded

5. See Also

Autoload Strategy — Details on autoload coverage
Architecture — How layers minimize overhead