Uniword is optimized for performance with large documents through lazy loading, efficient serialization, and optimized ZIP handling.
1. Performance Characteristics
1.1. Document Size Scaling
| Document | XML Size | Nodes | ZIP Parse | XML Parse | Deserialize | Total |
|---|---|---|---|---|---|---|
Typical letter |
~50KB |
~500 |
<0.01s |
~0.01s |
~0.1s |
~0.2s |
ISO 8601 |
295KB |
— |
<0.01s |
~0.02s |
~0.5s |
~0.6s |
ISO 690 |
4.8MB |
~130K |
0.02s |
0.05s |
~20s |
~20s |
ISO DIS 5878 |
29.4MB |
970K |
0.05s |
0.39s |
varies |
varies |
ZIP extraction and Nokogiri XML parsing are consistently fast. The bottleneck is in lutaml-model deserialization, which creates Ruby objects for every XML element and attribute.
1.2. Node Distribution
For the 970K-node ISO DIS 5878 document, the element breakdown is:
| Element | Count | Percentage |
|---|---|---|
|
296,980 |
30.6% |
|
63,336 |
6.5% |
|
55,916 |
5.8% |
|
33,780 |
3.5% |
|
33,207 |
3.4% |
|
32,485 |
3.3% |
|
29,718 |
3.1% |
|
29,482 |
3.0% |
|
28,283 |
2.9% |
|
28,283 |
2.9% |
Tab stops dominate because ISO documents use extensive tab-aligned numbering (~30 tabs per paragraph across ~27K paragraphs).
1.3. Optimizations Applied
The following lutaml-model optimizations have been applied:
| # | Optimization |
|---|---|
1 |
Skip |
2 |
Fast array key in |
3 |
Fast path in |
4 |
Cached |
5 |
|
6 |
Cache |
7 |
Cached |
8 |
Fast path skip |
9 |
Fast path in |
10 |
Cache type/namespace class/URI on |
These optimizations reduced ISO 690 deserialization from ~52s to ~20s.
2. Design Features
- Lazy loading
-
The 95% autoload strategy ensures only needed classes are loaded. Opening a simple document does not load the 760 element classes.
- Efficient serialization
-
Lutaml-model provides optimized XML serialization with minimized object allocation and string operations.
- Optimized ZIP handling
-
The rubyzip-based ZIP handler reads and writes DOCX packages efficiently, processing parts on demand.
- Thread-safe caching
-
The
CachedTypeResolveruses mutex synchronization to ensure thread-safe type resolution. Ruby’s Hash operations are atomic for single reads/writes, but compound operations (||=) require synchronization to avoid wasteful duplicate computation.
3. Tips for Large Documents
For documents with thousands of paragraphs or large embedded images:
-
Use
Document.openinstead of loading the entire document into memory when only reading -
Process paragraphs in batches rather than materializing the entire collection
-
Consider splitting very large documents into sections
# Efficient document processing
doc = Uniword.load('large-document.docx')
# Process paragraphs without creating intermediate arrays
doc.paragraphs.each do |para|
process(para)
end
# Save with optimized serialization
doc.save('output.docx')
4. Autoload Benefits
The 95% autoload strategy provides measurable startup improvements:
-
90% fewer classes loaded at startup compared to eager loading
-
Memory footprint scales with actual document complexity
-
Unused namespaces (math, charts, presentations) stay unloaded
5. See Also
-
Autoload Strategy — Details on autoload coverage
-
Architecture — How layers minimize overhead