Convert PowerPoint Speaker Notes to Audio Using Amazon Polly

This week I had a chance to work on an interesting request from another department in our company. They wanted to generate voiceover audio for a number of training PowerPoint decks so that they could make them into training videos. The script/content for this audio was embedded in the speaker notes of the slide deck.

Since I have been prototyping with Amazon Polly for several other solutions, I suggesting making use of Polly rather than having someone record their voice.

While this could easily have been done by getting someone (not me!) to copy and paste the notes into the Amazon Polly console and manually download the audio, I wanted to create something a little less manual.

I decided to write a simple script in Python to perform this task. There are three main activities for each file:

  1. Extract the speaker notes for each slide.
  2. Insert some tags into the text to add some pauses during the audio rendering.
  3. Render each slide’s notes to audio via Amazon Polly and save to disk.

Extract speaker notes from slides

To get the speaker notes out of the .PPTX file, I used a library called python-pptx. The the python-pptx documentation for instructions for how to install and use the library.

The function below takes a string containing the path to the PowerPoint file to be processed, and returns a array of tuples containing (slide number, speaker notes text).

from pptx import Presentation

def getNotes(file):
    # Use the Presentation() function to 
    # create a Presentation object for the
    # specified PPTX file.
    ppt=Presentation(file)

    notes = []

    # Iterate over the slides in the presentation
    for page, slide in enumerate(ppt.slides):
        # Extract the speaker notes for the given slide
        textNote = slide.notes_slide.notes_text_frame.text
        
        # Add some SSML tags to the text
        textNote = addTags(textNote)

        notes.append((page,textNote)) 
        
    return notes

Add some tags for pauses

In order to reduce some of the manual clean up required to produce “nice” audio, I found I had to add some SSML tags to the extracted notes (the SSML tags supported by Amazon Polly are listed here).

The tags I am adding are:

  • <speak>: outermost element around SSML content
  • <prosody>: used (in this case) to control the “speed” of the rendered audio
  • <break/>: inserts a short pause in the rendered audio. I insert these after commas to make the rendered audio sound a little more natural.
  • <s/>: inserts a slightly longer pause after a period.

I also do a little bit of cleanup here to get rid of some “ugly” characters that came out of the PowerPoint notes (like “\x0b”)

Here is the function which accomplishes this (it is very basic strong manipulation):

def addTags(textNote):
    # Add <speak> tag and speed control
    textNote = '<speak><prosody rate="medium">' + textNote + '</prosody></speak>'
    # Get rid of some character codes
    textNote = textNote.replace("\x0b", "")
    # Replace "EYE" with "E.Y.E."
    textNote = textNote.replace("EYE", "E.Y.E.")
    # Replace "\n" with "<s/>"
    textNote = textNote.replace("\n", "<s/>") 
    # Add pauses after commas
    textNote = textNote.replace(",", ",<break/>")
    # Add pauses after colons
    textNote = textNote.replace(":", ":<break/>")
    
    return textNote

Render audio using Amazon Polly

To access the Amazon Polly service from Python, I use the boto3 library. Follow that link to see how to install boto3 as well as the basic instruction for using it to access your AWS account.

First you create a client to access the AWS service (in this case Polly). From there you use the synthesize_speech method of the client to render the text to audio. There are a number of parameters you can pass in to control the rendering, including:

  • the speech engine to use (standard or neural)
  • the language
  • the voice to use (Amazon Polly supports a large number fo different voices)
  • the desired output format
  • the text to be rendered.

See the boto3 documentation and the Amazon Polly documentation to see the available parameters and the supported input values.

The response object from the synthesize_speech method contains an audio stream, which read from and then write out to disk.

import boto3

def renderAudio(file_root, slide_number, input_text):
    # Instantiate a boto3 client for Amazon Polly
    client = boto3.Session(aws_access_key_id='********************',
                           aws_secret_access_key='****************************************',
                           region_name='ca-central-1').client('polly')
    
    # Render the text to audio
    response = client.synthesize_speech(
        Engine = 'neural',
        LanguageCode = 'en-US',
        TextType = 'ssml',
        VoiceId='Matthew', 
        OutputFormat='mp3', 
        Text = input_text)
    
    # Save it to disk
    file = open(file_root + '_slide_' + str(slide_number) + '_audio.mp3', 'wb')
    file.write(response['AudioStream'].read())
    file.close()

Putting it all together

That’s about is. Given these methods all you have to do now is something like this (note that this will try to write the output audio to the same folder where it finds your PPTX file):

input_file = <path to your PPTX file>
 
file_root = input_file[:-5]  
notes = getNotes(input_file)
print("notes from " + file_root + "\n")
for note in notes:
    renderAudio(file_root, note[0]+1, note[1])

NOTE: I have not added any validation or error handling to this code, so copy/paste at your own risk!

Be thankful you cannot understand their pain

This may not be the most elegant or coherent thing I have ever written, but it is 4 in the morning, and I cannot sleep because this bothers me so much. 

Over the past week, we have all heard much about the graves of some 215 children found at the site of the Kamloops Indian Residential School. While I do not think this comes as a surprise to anyone who has been paying any attention, I think that facing the reality of this tragedy and knowing that it is likely just the tip of the iceberg should be a source of immense pain and outrage not just for Indigenous communities, but for each and every one of us. 

I would like to be able to say that I understand or even imagine what survivors and affected communities are feeling but in truth I cannot. I cannot even begin to comprehend. 

I have a five year-old granddaughter, with whom I have been very close. Unfortunately, for reasons I will not get into, I have not been allowed to see her for the past 6 months (and do not honestly know if I will ever see her again) and this has been extremely difficult for me.

But I know where she is. I know she is safe. I know she is taken care of. And I know she is with people who love her.

I look at her face, and I cannot imagine knowing that she has been taken away. Knowing she is alone and afraid. Not knowing where she is, or who is caring for her, or even if they are caring for her. Not knowing when or even if I will ever see her again. Knowing or suspecting that she is being abused. Knowing that her very identity is being stripped from her. 

Every one of these children was someone’s child, someone’s grandchild, and some community’s future. Every single one, and thousands more. This is not abstract. This is real, and it is horrendous.

This breaks my heart. From what I have read and heard in the media it breaks everyone’s hearts. 

But that is not enough – not by a long shot. Where is outrage?

I applaud the Indigenous community’s focus on Truth and Reconciliation, and greatly respect their strength and wisdom in following that path.

But for the rest of us, where is the outrage ay the things done in our name? Where is the absolute outrage that our government, the Government of Canada, not just allowed this to happen but actively participated? That the government elected by Canadians, that represents Canadians, was complicit in these atrocities?

We as a species and as a society can and must be better than this!

I would like to end with 3 calls to action:

  1. Listen. Listen mindfully to the stories of survivors, and to the communities. A few minutes of mindful listening can contribute greatly to understanding and healing.
  2. I ask that everyone who reads this take the time today to look at the faces of your children, of your grandchildren, and be damn grateful that you cannot comprehend the pain of these children, these parents, these grandparents, and these communities.
  3. And I ask, how are we and our government(s) going to make this right?

5 Steps to Faster Mobile Web App Development

New Brunswick start-up Agora Mobile has developed a revolutionary platform for the visual development of mobile web applications.

As we move closer to launch, we are beginning a private beta targeting developers (and other forward-thinking sorts). To kick off this beta, we are beginning a series of webinars which introduce the platform and concepts. The first webinar is this Thursday (June 26).

Register for the webinar at http://developers.vizwik.com – and as a bonus you will become part of the private beta!

EARTH University: 
Learning for a clean future

http://www.dw.de/learning-for-a-clean-future/a-16408200

Interesting and very inspirational article/video about programs at EARTH University in Costa Rica (and in Costa Rica in general), both teaching and implementing environmentally sustainable practices.

Really makes on wonder why countries like Canada cannot do the same – it is almost like our government doesn’t give a shit.

Ma

 

Leap Motion

This is too cool! And at the advertised price point, it would definitely be a game changer in NUI development. I do not agree that it replaces a mouse and keyboard, but I do not think in terms of “replacement”. It provides another mode of interaction, along with mouse, keyboard, touch and voice, all of which can augment one another to provide an optimal user experience.

I want one!

Windows 8 Adoption: My Predictions

With Windows 8 rumoured to go RTM near mid-year, and released before year end, I thought I would hazard a few predictions about its acceptance/adoption:

The new Windows 8 Start Screen, making use of ...
(Photo credit: Wikipedia)

  1. Apple users will hate it. Why? Because it is not from Apple, and nothing cool can from from anyone but Apple.
  2. Linux users will hate it. Why? Because it is from Microsoft, and Microsoft is the root of all that is evil in the universe. Oh, and it has a GUI.
  3. Android users will hate it. Again, because it comes from Microsoft.
  4. Many Microsoft fans will love it, but will be afraid to admit it in front of their “cool” Apple and Android friends.
  5. Microsoft Marketing will fail. I hope this is not the case, but the last half dozen years or so leads me to believe that Microsoft cannot communicate with consumers (except XBox consumers, and gamers are a little different anyway)
  6. Other than on a tablet or other touch device, no one will upgrade to Windows 8 until they absolutely have to (unless I am wrong and Microsoft marketing hits it out of the park).

I don’t think these are particularly high risk predictions!

P.S. – I personally really like Windows 8 and the Metro UI (not crazy about the HTML5 + JavaScript development model, though).

Welcome to The Continuum – Part Two

Earlier today, I began to explain The Continuum as an experiment in Social Brainstorming. But that is only half the story (actually, a third, but we will deal with that later).

Beyond this, The Continuum is meant as a demonstration of a Seamless User Experience.

The Continuum grew out of a very simple exercise in which I was brainstorming a new (for me) subject area. While reading about this topic, I was recording (short) thoughts on PostIt notes, and putting them randomly all over the whiteboards in my office. I was doing this in the hope that patterns would eventually emerge – patterns I would not otherwise see.

While I was doing this, someone came into my office, and over the course of our discussions, the question arose as to why I was not using some computer-based tool to do this (I am, after all, a nerd). The reality is, unfortunately, that no tools exist which would allow me to do this without the technology getting in the way. Any computer-based tool tends to make assumptions about how you work, or worse yet force a pattern of work on you. Or you spend more time playing with the tool than you do capturing ideas. This cognitive friction in software means that I tend to lose ideas while trying to capture them, or at least lose the flow of ideas.

It should all be as simple as scribbling on a PostIt note, and slapping it on a whiteboard.

But it isn’t.

We now live in a world dominated by mobile devices. That said, there are still a few (hundred million) PCs in use. Even more, there are now many large format displays offering rich multi-touch experiences, as well as other modes of interaction including gestures and voice recognition.

The question then arises “What constitutes a great user experience in this new world of multi-modal interactions?” This is often described in terms of a Natural User Interface (NUI), which is unfortunately defined somewhat circularly as an interface which feels natural (ok, not quite that obviously, but nearly).

While this is a question I have been pondering for some time, I do not have an answer, or at very least not the answer (if I did, I would be a lot richer and more famous than I am!)

One aspect of the new user experience that is key to The Continuum experiment is that the user experience should be seamless across all (or at least most) devices. Note that this does not mean that all devices should deliver all of the functionality of the solution. What it does mean is that the solution should exist on all devices, presenting those aspects of the functionality which is appropriate to the device format. Let’s call this Device Appropriateness.

In addition, the user interface should be as transparent as possible. As much as possible, the user should interact directly with content, rather than interacting with content through some artificial UI constructs. Buttons, menus, icons – these are all artificial UI constructs. In a perfect world the UI is completely disappears.

Device Appropriateness.

Cognitive Transparency.

This is The Continuum.