Uniword is optimized for performance with large documents through lazy loading, efficient serialization, and optimized ZIP handling.

1. Performance Characteristics

1.1. Document Size Scaling

Document XML Size Nodes ZIP Parse XML Parse Deserialize Total

Typical letter

~50KB

~500

<0.01s

~0.01s

~0.1s

~0.2s

ISO 8601

295KB

 — 

<0.01s

~0.02s

~0.5s

~0.6s

ISO 690

4.8MB

~130K

0.02s

0.05s

~20s

~20s

ISO DIS 5878

29.4MB

970K

0.05s

0.39s

varies

varies

ZIP extraction and Nokogiri XML parsing are consistently fast. The bottleneck is in lutaml-model deserialization, which creates Ruby objects for every XML element and attribute.

1.2. Node Distribution

For the 970K-node ISO DIS 5878 document, the element breakdown is:

Element Count Percentage

<w:tab>

296,980

30.6%

<w:rPr>

63,336

6.5%

<w:sz>

55,916

5.8%

<w:szCs>

33,780

3.5%

<w:r>

33,207

3.4%

<w:t>

32,485

3.3%

<w:jc>

29,718

3.1%

<w:divId>

29,482

3.0%

<w:p>

28,283

2.9%

<w:pPr>

28,283

2.9%

Tab stops dominate because ISO documents use extensive tab-aligned numbering (~30 tabs per paragraph across ~27K paragraphs).

1.3. Optimizations Applied

The following lutaml-model optimizations have been applied:

# Optimization

1

Skip build_input_declaration_plan for non-root elements

2

Fast array key in TransformationRegistry (avoids symbol construction)

3

Fast path in handle_transform_method when no transforms present

4

Cached Namespace#all_uris as frozen array

5

allocate_for_deserialization using allocate + init_deserialization_state

6

Cache castable? at MappingRule init; type resolution cached on Attribute

7

Cached extract_register_id with fast nil-check

8

Fast path skip ImportTransformer when rule and attr have no transforms

9

Fast path in value_map when options are empty (no nil/empty/omitted)

10

Cache type/namespace class/URI on Attribute (@type_ns_class_cache, etc.)

These optimizations reduced ISO 690 deserialization from ~52s to ~20s.

2. Design Features

Lazy loading

The 95% autoload strategy ensures only needed classes are loaded. Opening a simple document does not load the 760 element classes.

Efficient serialization

Lutaml-model provides optimized XML serialization with minimized object allocation and string operations.

Optimized ZIP handling

The rubyzip-based ZIP handler reads and writes DOCX packages efficiently, processing parts on demand.

Thread-safe caching

The CachedTypeResolver uses mutex synchronization to ensure thread-safe type resolution. Ruby’s Hash operations are atomic for single reads/writes, but compound operations (||=) require synchronization to avoid wasteful duplicate computation.

3. Tips for Large Documents

For documents with thousands of paragraphs or large embedded images:

  • Use Document.open instead of loading the entire document into memory when only reading

  • Process paragraphs in batches rather than materializing the entire collection

  • Consider splitting very large documents into sections

# Efficient document processing
doc = Uniword.load('large-document.docx')

# Process paragraphs without creating intermediate arrays
doc.paragraphs.each do |para|
  process(para)
end

# Save with optimized serialization
doc.save('output.docx')

4. Autoload Benefits

The 95% autoload strategy provides measurable startup improvements:

  • 90% fewer classes loaded at startup compared to eager loading

  • Memory footprint scales with actual document complexity

  • Unused namespaces (math, charts, presentations) stay unloaded

5. See Also