This week I had a chance to work on an interesting request from another department in our company. They wanted to generate voiceover audio for a number of training PowerPoint decks so that they could make them into training videos. The script/content for this audio was embedded in the speaker notes of the slide deck.
Since I have been prototyping with Amazon Polly for several other solutions, I suggesting making use of Polly rather than having someone record their voice.
While this could easily have been done by getting someone (not me!) to copy and paste the notes into the Amazon Polly console and manually download the audio, I wanted to create something a little less manual.
I decided to write a simple script in Python to perform this task. There are three main activities for each file:
- Extract the speaker notes for each slide.
- Insert some tags into the text to add some pauses during the audio rendering.
- Render each slide’s notes to audio via Amazon Polly and save to disk.
Extract speaker notes from slides
The function below takes a string containing the path to the PowerPoint file to be processed, and returns a array of tuples containing (slide number, speaker notes text).
from pptx import Presentation def getNotes(file): # Use the Presentation() function to # create a Presentation object for the # specified PPTX file. ppt=Presentation(file) notes =  # Iterate over the slides in the presentation for page, slide in enumerate(ppt.slides): # Extract the speaker notes for the given slide textNote = slide.notes_slide.notes_text_frame.text # Add some SSML tags to the text textNote = addTags(textNote) notes.append((page,textNote)) return notes
Add some tags for pauses
In order to reduce some of the manual clean up required to produce “nice” audio, I found I had to add some SSML tags to the extracted notes (the SSML tags supported by Amazon Polly are listed here).
The tags I am adding are:
- <speak>: outermost element around SSML content
- <prosody>: used (in this case) to control the “speed” of the rendered audio
- <break/>: inserts a short pause in the rendered audio. I insert these after commas to make the rendered audio sound a little more natural.
- <s/>: inserts a slightly longer pause after a period.
I also do a little bit of cleanup here to get rid of some “ugly” characters that came out of the PowerPoint notes (like “\x0b”)
Here is the function which accomplishes this (it is very basic strong manipulation):
def addTags(textNote): # Add <speak> tag and speed control textNote = '<speak><prosody rate="medium">' + textNote + '</prosody></speak>' # Get rid of some character codes textNote = textNote.replace("\x0b", "") # Replace "EYE" with "E.Y.E." textNote = textNote.replace("EYE", "E.Y.E.") # Replace "\n" with "<s/>" textNote = textNote.replace("\n", "<s/>") # Add pauses after commas textNote = textNote.replace(",", ",<break/>") # Add pauses after colons textNote = textNote.replace(":", ":<break/>") return textNote
Render audio using Amazon Polly
To access the Amazon Polly service from Python, I use the boto3 library. Follow that link to see how to install boto3 as well as the basic instruction for using it to access your AWS account.
First you create a client to access the AWS service (in this case Polly). From there you use the synthesize_speech method of the client to render the text to audio. There are a number of parameters you can pass in to control the rendering, including:
- the speech engine to use (standard or neural)
- the language
- the voice to use (Amazon Polly supports a large number fo different voices)
- the desired output format
- the text to be rendered.
The response object from the synthesize_speech method contains an audio stream, which read from and then write out to disk.
import boto3 def renderAudio(file_root, slide_number, input_text): # Instantiate a boto3 client for Amazon Polly client = boto3.Session(aws_access_key_id='********************', aws_secret_access_key='****************************************', region_name='ca-central-1').client('polly') # Render the text to audio response = client.synthesize_speech( Engine = 'neural', LanguageCode = 'en-US', TextType = 'ssml', VoiceId='Matthew', OutputFormat='mp3', Text = input_text) # Save it to disk file = open(file_root + '_slide_' + str(slide_number) + '_audio.mp3', 'wb') file.write(response['AudioStream'].read()) file.close()
Putting it all together
That’s about is. Given these methods all you have to do now is something like this (note that this will try to write the output audio to the same folder where it finds your PPTX file):
input_file = <path to your PPTX file> file_root = input_file[:-5] notes = getNotes(input_file) print("notes from " + file_root + "\n") for note in notes: renderAudio(file_root, note+1, note)
NOTE: I have not added any validation or error handling to this code, so copy/paste at your own risk!