SSML — the Speech Synthesis Markup Language — provides a way for content creators to enhance the default synthetic speech rendering of their publications at the markup level. The liberal use of SSML ensures that anyone listening to your work via TTS playback hears the prose as intended, not based on the best guess of their rendering engine.
The phoneme
element from SSML has been implemented in EPUB 3 as a pair of attributes for
defining pronunciations at the markup level:
ssml:alphabet
attribute is used to set the default phonetic alphabet.ssml:ph
attribute is used to define the pronunciation for any element with text
content or for which a phonetic pronunciation can be associated (e.g., an empty element whose voicing
is derived from an attached attribute).(Support for the full SSML specification is not available in EPUB 3.)
Unlike PLS lexicons, SSML provides fine-grained control over pronunciation at the markup level. SSML can be used to override a default pronunciation for heteronyms, to correctly pronounce complex word and number forms, etc.
To use the SSML attributes, you must first declare the SSML namespace. The declaration is typically made
once per document on the root html
element. (See Example 1.)
A default alphabet is also typically defined once on the root html
element, as it is rare to
need to switch phonetic alphabets within any single document. Adding the ssml:alphabet
attribute to the root ensures that all instance of the ssml:ph
attribute have an in-scope
alphabet defined. It is an error to define a pronunciation in an ssml:ph
attribute without
an in-scope alphabet, and will result in rendering errors. (See Example
1.)
When an ssml:ph
attribute is encountered, it's value is passed to the text-to-speech (TTS)
engine in place of the element's content, providing the lowest-level override. The pronunciation of SSML
attributes also takes precedence over PLS lexicon entries, ensuring that heteronyms and other exceptions
to the rule can be properly handled.
Note that the value of the ssml:ph
attribute entirely replaces the content of the element
that it is attached to, including all descendant elements. The attribute should not be attached to a
p
tag to define the pronunciation of one word contained in the paragraph, for example,
as only that one word will be read in place of the entire paragraph. The use of span
elements is recommended when no markup exists on the word(s) that need a pronunciation attached.
The SSML attributes are not valid on SVG or MathML content, but are valid on any XHTML content that can be embedded in those grammars.
At the time of writing, no reading systems have appeared that support the new SSML enhancements in EPUB 3. Please send a report if the situation changes and this page has not been updated.
Although IPA is arguably the most widely recognized phonetic alphabet, that does not mean that it has full support even in existing synthetic speech engines. Some engines support only their own alphabets, for example. IPA is also less developer-friendly than X-SAMPA because it uses Unicode characters that require modifying most keyboard layouts to input, whereas X-SAMPA is ASCII-based. Internal workflows should be a determining factor at this time. The ultimate answer will depend on what engines are employed in reading systems.
Note that it is possible to translate one alphabet representation to the other, so work in either
alphabet shouldn't ever be lost
if there does turn out to be a clear winner and loser.
The inclusion of the technologies in EPUB 3 was not to require a choice to be made; the technologies are meant to complement each other. PLS lexicons allow you to define a word once and have the TTS engine do the work of replacing it each time it occurs in the prose. SSML, on the other hand, provides the fine-grained control that is just not possible in a lexicon, at the price of having to tag each instance of a term that has to be replaced.
It is possible to use SSML exclusively, but it is costly in terms of production time and can excessively bloat the size of your content files depending on how many unique terms have to be handled and how often they occur.