Let’s study the behavior of Alexa. Imagine the user says ”Alexa, what is the weather?” Now, which components of Alexa’s technical architecture are involved and how do they interact? In the following diagram, we give an overview of the involved components, the processing steps, and the transformations. Let’s follow the request step by step!

Alexa Processing Pipeline

Step by Step through the Alexa Processing Pipeline. Compare the numbered steps in the picture:

(1) Audio is transformed into a digital audio stream: The Echo device records audio, digitalizes it, encodes it, and sends it as a digital audio stream to the Alexa processing infrastructure in the cloud.

(2) The digital audio stream is transformed to text: Alexa speech recognition processes the digital audio stream, recognizes the speech, and turns it into (annotated) text.

(3) The recognized text is used to detect the Skill: Alexa detects the relevant Skill and activates it.

(4) The recognized input text and detected Skill are used to detect the intent: Alexa uses the intent recognition model of the active Skill to extract the intent and slots (parameters) from the text. The intent recognition model is built from the Skill Interface definition.

(5) Intent and slots are transformed into a response text: Alexa calls the Skill Service associated with the Skill and passes the name of the recognized intent, optional slots and context information (such as the logged in user). The Skill Service calculates a response to the user’s intent, given the context and the input parameters. The Skill Service may also cause side effects, and trigger actions, such as switching on a home automation device. The Skill may want to send a response back to the user, so it generates a textual response for the user. The Skill Service returns the (SSML-annotated) textual response as a string.

(6) The response text is transformed into a digital audio stream: The Alexa text to speech engine renders an audio stream based on the (SSML-annotated) response text. In effect, Alexa reads the text. Alexa sends the audio stream to the user’s Echo device.

(7) The digital audio stream is transformed into audio: The Echo device plays back the digital audio stream to the user.

This is an excerpt from my new book “Making Money with Alexa Skills – A Developer’s Guide”. In this book, I describe not only how to develop, but also how to monetize Alexa Skills. Account linking is one of the possibilities for personalizing a Skill and make it unique – more practical approaches for personalizing Skills are described in the book.

How do Alexa Skills work?

Also published on Medium.

Tagged on:         

Matthias Biehl

As API strategist, Matthias helps clients discover their opportunities for innovation with APIs & ecosystems and turn them into actionable digital strategies. Based on his experience in leading large-scale API initiatives in both business and technology roles, he shares best practices and provides both strategic and practical guidance. He has stayed a techie at heart and at some point, got a Ph.D. Matthias publishes a blog at api-university.com, is the author of several books on APIs, and regularly speaks at technology conferences.