Converting Basic DITA Topics to SSML/SAPI for TTS Output

1. Introduction

These pages have a collection DITA topics, and XSL transforms that can be used to generate SAPI/SSML equivalents. The contents are as follows:

   

  • Sample Task and Concept Topics from DITA Open Tookit 1.3.1
  • XSLT files for DITA Task->SAPI/SSML and DITA Concept->SAPI/SSML
  • SAPI and SSML Files generated by the transform

 

 

 

 

Sample files are available for download here: Download DITA-To-SAPI-SSML.zip

 

DITA to SAPI: The version SAPI for supported on Windows XP is 5.1. SAPI is not a markup language, but it supports an XML-based language for TTS. The SAPI documentation references SABLE (one of SSML’s ancestors), but it is not compatible with it. Most of the differences relate to minor changes in terminology.

 

DITA to SSML: Speech Synthesis Markup Language (SSML) is a W3C standard. SSML is designed to provide an XML-based markup language that provides authors of synthesizable text the capability to control aspects of the synthesized speech including pronunciation, volume, pitch, rate, etc. SSML is supported in Windows Vista supplied SAPI version 5.3.

 

2. XSLT Transform Details

The sample tasks in the DITA OT are short, and very simple. These types of topic can be effectively presented using speech. The general approach to transform DITA concept topics to SAPI/SSML is as follows:

2.1. Retain DITA structural tags

This is important because these tags are useful for a structural DOM-style navigation of the SSML. Applications thus have the option to submit the SSML material in chunks to a TTS, as opposed to submitting a complete document. These unrecognized SSML tags will be ignored by SAPI compatible engines.

2.2. Create additional SAPI/SSML elements

There are some additional SAPI/SSML elements that are required for effective presentation of the material using speech. These include the following:

 

  • Add structural mark events: SSML “mark” (equivalent to SAPI “bookmark”) events can be particularly useful in building interactive speech-enabled user assistance systems. They enable the assistance system, and potentially other application software, to be aware of the current position of the auditory assistance output stream. These events are an important element in facilitating more dynamic interaction between the assistance system and an application e.g. dynamic highlighting of user interface elements in synchronization with speech output. As a result, the transform adds mark event elements related to the current structural element (e.g. title, taskstep, etc) in the SSML output.

 

  • Add mark elements for sonification: It might be useful to play sounds to represent certain structural elements e.g. play a non-speech sound alert to be played as a cue for the listener prior to presenting the speech material, play sound associated with a bullet point, etc.

 

  • Add structural pauses: The appropriate layouts of elements in visual material, and the surrounding white space, are important in facilitating comprehension of a text. Likewise, appropriate pauses are important element in being able to understand speech. As part of a general text structure analysis a TTS engine will apply rules for inserting pausing when processing a text. However, some additional explicit pauses are required for titles, section headers, etc. The XSLT files have xsl variables defined for various structural pauses with some default values. For some more information on specifying appropriate pause values, and understanding the role of pauses in speech, see:

 

    • Cohen M., Giangola J., Balogh J. Voice User Interface Design, ISBN: 0321185765 (2004) Pages 196-199
    • Pitt I., Edwards A. Design of Speech Based Devices.     ISBN: 1852334363 (2002)
    • Goldman-Eisler, F. (1972). Pauses, Clauses, Sentences. Language And Speech 15: 103-113.
    • Deuz, D. 1982. "Silent and Non-Silent Pauses in Three Speech Styles ". Language and Speech, 25, 1, 11-28.

2.3. To do list

Some additional work needs to done to do an effective mapping of certain visual elements to an effective auditory equivalent. These include:

 

·         Highlighted text: Use the <emph> tag.

·         Lists: Create introductory text to define how many items are in a list (e.g. “List of 5 elements”) and to indicate the final element (e.g. “and finally, ”)

·         Tables ….

 

3. Comments?

Any comments/feedback/etc?

aidankehoe at gmail dot com