Convert PowerPoint Speaker Notes to Audio Using Amazon Polly

This week I had a chance to work on an interesting request from another department in our company. They wanted to generate voiceover audio for a number of training PowerPoint decks so that they could make them into training videos. The script/content for this audio was embedded in the speaker notes of the slide deck.

Since I have been prototyping with Amazon Polly for several other solutions, I suggesting making use of Polly rather than having someone record their voice.

While this could easily have been done by getting someone (not me!) to copy and paste the notes into the Amazon Polly console and manually download the audio, I wanted to create something a little less manual.

I decided to write a simple script in Python to perform this task. There are three main activities for each file:

  1. Extract the speaker notes for each slide.
  2. Insert some tags into the text to add some pauses during the audio rendering.
  3. Render each slide’s notes to audio via Amazon Polly and save to disk.

Extract speaker notes from slides

To get the speaker notes out of the .PPTX file, I used a library called python-pptx. The the python-pptx documentation for instructions for how to install and use the library.

The function below takes a string containing the path to the PowerPoint file to be processed, and returns a array of tuples containing (slide number, speaker notes text).

from pptx import Presentation

def getNotes(file):
    # Use the Presentation() function to 
    # create a Presentation object for the
    # specified PPTX file.

    notes = []

    # Iterate over the slides in the presentation
    for page, slide in enumerate(ppt.slides):
        # Extract the speaker notes for the given slide
        textNote = slide.notes_slide.notes_text_frame.text
        # Add some SSML tags to the text
        textNote = addTags(textNote)

    return notes

Add some tags for pauses

In order to reduce some of the manual clean up required to produce “nice” audio, I found I had to add some SSML tags to the extracted notes (the SSML tags supported by Amazon Polly are listed here).

The tags I am adding are:

  • <speak>: outermost element around SSML content
  • <prosody>: used (in this case) to control the “speed” of the rendered audio
  • <break/>: inserts a short pause in the rendered audio. I insert these after commas to make the rendered audio sound a little more natural.
  • <s/>: inserts a slightly longer pause after a period.

I also do a little bit of cleanup here to get rid of some “ugly” characters that came out of the PowerPoint notes (like “\x0b”)

Here is the function which accomplishes this (it is very basic strong manipulation):

def addTags(textNote):
    # Add <speak> tag and speed control
    textNote = '<speak><prosody rate="medium">' + textNote + '</prosody></speak>'
    # Get rid of some character codes
    textNote = textNote.replace("\x0b", "")
    # Replace "EYE" with "E.Y.E."
    textNote = textNote.replace("EYE", "E.Y.E.")
    # Replace "\n" with "<s/>"
    textNote = textNote.replace("\n", "<s/>") 
    # Add pauses after commas
    textNote = textNote.replace(",", ",<break/>")
    # Add pauses after colons
    textNote = textNote.replace(":", ":<break/>")
    return textNote

Render audio using Amazon Polly

To access the Amazon Polly service from Python, I use the boto3 library. Follow that link to see how to install boto3 as well as the basic instruction for using it to access your AWS account.

First you create a client to access the AWS service (in this case Polly). From there you use the synthesize_speech method of the client to render the text to audio. There are a number of parameters you can pass in to control the rendering, including:

  • the speech engine to use (standard or neural)
  • the language
  • the voice to use (Amazon Polly supports a large number fo different voices)
  • the desired output format
  • the text to be rendered.

See the boto3 documentation and the Amazon Polly documentation to see the available parameters and the supported input values.

The response object from the synthesize_speech method contains an audio stream, which read from and then write out to disk.

import boto3

def renderAudio(file_root, slide_number, input_text):
    # Instantiate a boto3 client for Amazon Polly
    client = boto3.Session(aws_access_key_id='********************',
    # Render the text to audio
    response = client.synthesize_speech(
        Engine = 'neural',
        LanguageCode = 'en-US',
        TextType = 'ssml',
        Text = input_text)
    # Save it to disk
    file = open(file_root + '_slide_' + str(slide_number) + '_audio.mp3', 'wb')

Putting it all together

That’s about is. Given these methods all you have to do now is something like this (note that this will try to write the output audio to the same folder where it finds your PPTX file):

input_file = <path to your PPTX file>
file_root = input_file[:-5]  
notes = getNotes(input_file)
print("notes from " + file_root + "\n")
for note in notes:
    renderAudio(file_root, note[0]+1, note[1])

NOTE: I have not added any validation or error handling to this code, so copy/paste at your own risk!