SSML

SSML — the Speech Synthesis Markup Language — provides a way for content creators to enhance the default synthetic speech rendering of their publications at the markup level. The liberal use of SSML ensures that anyone listening to your work via TTS playback hears the prose as intended, not based on the best guess of their rendering engine.

The phoneme element from SSML has been implemented in EPUB 3 as a pair of attributes for defining pronunciations at the markup level:

The ssml:alphabet attribute is used to set the default phonetic alphabet.
The ssml:ph attribute is used to define the pronunciation for any element with text content or for which a phonetic pronunciation can be associated (e.g., an empty element whose voicing is derived from an attached attribute).

(Support for the full SSML specification is not available in EPUB 3.)

Unlike PLS lexicons, SSML provides fine-grained control over pronunciation at the markup level. SSML can be used to override a default pronunciation for heteronyms, to correctly pronounce complex word and number forms, etc.

To use the SSML attributes, you must first declare the SSML namespace. The declaration is typically made once per document on the root html element. (See Example 1.)

A default alphabet is also typically defined once on the root html element, as it is rare to need to switch phonetic alphabets within any single document. Adding the ssml:alphabet attribute to the root ensures that all instance of the ssml:ph attribute have an in-scope alphabet defined. It is an error to define a pronunciation in an ssml:ph attribute without an in-scope alphabet, and will result in rendering errors. (See Example 1.)

When an ssml:ph attribute is encountered, it's value is passed to the text-to-speech (TTS) engine in place of the element's content, providing the lowest-level override. The pronunciation of SSML attributes also takes precedence over PLS lexicon entries, ensuring that heteronyms and other exceptions to the rule can be properly handled.

Note that the value of the ssml:ph attribute entirely replaces the content of the element that it is attached to, including all descendant elements. The attribute should not be attached to a p tag to define the pronunciation of one word contained in the paragraph, for example, as only that one word will be read in place of the entire paragraph. The use of span elements is recommended when no markup exists on the word(s) that need a pronunciation attached.

The SSML attributes are not valid on SVG or MathML content, but are valid on any XHTML content that can be embedded in those grammars.

Examples

Example 1 — Declaring the SSML namespace and phonetic alphabet on the document root

<html …
   xmlns:ssml="http://www.w3.org/2001/10/synthesis"
   ssml:alphabet="x-sampa">
   …
</html>

Example 2 — Declaring a phonetic alphabet and pronunciation at the word level

<p>
   … farther <span 
   ssml:alphabet="ipa" ssml:ph="nɔrθ">N.</span>
   another elevation begins …
</p>

(Note that single letters are a poor choice to define in a lexicon because they could be initials, directions or other forms of contractions depending on context.)

Example 3 — Defining different pronunciations for heteronyms

<p>
   The guitarist was playing a
   <span ssml:ph="beIs">bass</span> that was shaped
   like a <span ssml:ph="b&s">bass</span>.
</p>

Example 4 — Defining a pronunciation when a default alphabet has already been set

<p>
   The guitarist was playing a bass that was shaped
   like a <span ssml:ph="b&s">bass</span>.
</p>

A PLS lexicon would including the following entry to define the default pronunciation:

<lexeme>
   <grapheme>bass</grapheme>
   <phoneme>beIs</phoneme>
</lexeme>

Example 5 — Defining different pronunciations based on context

<p>
   You'll be an
   <span ssml:ph="Ekstr@ lArdZ">XL</span>
   by the end of Super Bowl 
   <span ssml:ph="'fOrti">XL</span>
   at the rate you're eating.
</p>

Compliance References and Standards

EPUB 3 — SSML Attributes

Additional Resources

Frequently Asked Questions

Are lexicons supported at this time?

At the time of writing, no reading systems have appeared that support the new SSML enhancements in EPUB 3. Please send a report if the situation changes and this page has not been updated.

Should I use IPA or X-SAMPA or something else to write my pronunciations?

Although IPA is arguably the most widely recognized phonetic alphabet, that does not mean that it has full support even in existing synthetic speech engines. Some engines support only their own alphabets, for example. IPA is also less developer-friendly than X-SAMPA because it uses Unicode characters that require modifying most keyboard layouts to input, whereas X-SAMPA is ASCII-based. Internal workflows should be a determining factor at this time. The ultimate answer will depend on what engines are employed in reading systems.

Note that it is possible to translate one alphabet representation to the other, so work in either alphabet shouldn't ever be lost if there does turn out to be a clear winner and loser.

Should I use PLS lexicons or SSML?

The inclusion of the technologies in EPUB 3 was not to require a choice to be made; the technologies are meant to complement each other. PLS lexicons allow you to define a word once and have the TTS engine do the work of replacing it each time it occurs in the prose. SSML, on the other hand, provides the fine-grained control that is just not possible in a lexicon, at the price of having to tag each instance of a term that has to be replaced.

It is possible to use SSML exclusively, but it is costly in terms of production time and can excessively bloat the size of your content files depending on how many unique terms have to be handled and how often they occur.