Converting Basic DITA
Topics to SSML/SAPI for TTS Output
1. Introduction
These pages have a collection DITA topics, and XSL
transforms that can be used to generate SAPI/SSML equivalents. The contents are
as follows:
- Sample
Task and Concept Topics from DITA Open Tookit
1.3.1
- XSLT
files for DITA Task->SAPI/SSML and DITA Concept->SAPI/SSML
- SAPI
and SSML Files generated by the transform

Sample files are available for download here: Download
DITA-To-SAPI-SSML.zip
DITA to SAPI:
The version SAPI for supported on Windows XP is 5.1. SAPI is not a markup language, but it supports an XML-based language for
TTS. The SAPI documentation references SABLE (one of SSML’s
ancestors), but it is not compatible with it. Most of the differences relate to
minor changes in terminology.
DITA to SSML:
Speech Synthesis Markup Language (SSML) is a W3C
standard. SSML is designed to provide an XML-based markup
language that provides authors of synthesizable text the capability to control
aspects of the synthesized speech including pronunciation, volume, pitch, rate,
etc. SSML is supported in Windows Vista supplied SAPI version 5.3.
2. XSLT Transform Details
The sample
tasks in the DITA OT are short, and very simple. These types of topic can be
effectively presented using speech. The general approach to transform DITA
concept topics to SAPI/SSML is as follows:
2.1. Retain DITA
structural tags
This is important because these
tags are useful for a structural DOM-style navigation of the SSML. Applications thus have the option
to submit the SSML material in chunks to a TTS, as opposed to submitting a
complete document. These unrecognized SSML tags will be ignored by SAPI
compatible engines.
2.2. Create additional
SAPI/SSML elements
There are
some additional SAPI/SSML elements that are required for effective presentation
of the material using speech. These include the following:
- Add structural mark events: SSML
“mark” (equivalent to SAPI “bookmark”) events can be particularly useful
in building interactive speech-enabled user assistance systems. They
enable the assistance system, and potentially other application software,
to be aware of the current position of the auditory assistance output
stream. These events are an important element in facilitating more dynamic
interaction between the assistance system and an application e.g. dynamic highlighting
of user interface elements in synchronization with speech output. As a
result, the transform adds mark event elements related to the current
structural element (e.g. title, taskstep, etc)
in the SSML output.
- Add mark elements for sonification: It might be useful to play sounds to
represent certain structural elements e.g. play a non-speech sound alert
to be played as a cue for the listener prior to presenting the speech
material, play sound associated with a bullet point, etc.
- Add structural pauses: The
appropriate layouts of elements in visual material, and the surrounding
white space, are important in facilitating comprehension of a text.
Likewise, appropriate pauses are important element in being able to
understand speech. As part of a general text structure analysis a TTS
engine will apply rules for inserting pausing when processing a text.
However, some additional explicit pauses are required for titles, section
headers, etc. The XSLT files have xsl variables
defined for various structural pauses with some default values. For some
more information on specifying appropriate pause values, and understanding
the role of pauses in speech, see:
- Cohen M., Giangola
J., Balogh J. Voice User Interface Design,
ISBN: 0321185765 (2004) Pages 196-199
- Pitt I., Edwards A. Design of Speech
Based Devices. ISBN: 1852334363
(2002)
- Goldman-Eisler,
F. (1972). Pauses, Clauses, Sentences. Language And
Speech 15: 103-113.
- Deuz, D. 1982. "Silent and Non-Silent Pauses in Three Speech Styles
". Language and Speech, 25, 1, 11-28.
2.3. To do list
Some
additional work needs to done to do an effective mapping of certain visual
elements to an effective auditory equivalent. These include:
·
Highlighted
text: Use the <emph> tag.
·
Lists:
Create introductory text to define how many items are in a list (e.g. “List of
5 elements”) and to indicate the final element (e.g. “and finally, ”)
·
Tables
….
3. Comments?
Any comments/feedback/etc?
aidankehoe at gmail
dot com