Get More Listeners with Audio Processing for Smart Speakers

Smart Speaker Adoption

The Smart Audio Report from NPR and Edison Research counts 43 million Americans who own at least one smart speaker. First adopters of smart speakers say these devices are now their primary way of listening to audio. And while there are myriad choices for audio services available on smart speakers, they are becoming a significant replacement for AM/FM radios in the kitchen or bedroom. Some recent research is showing that radio’s most loyal listeners, the so-called “P1s,” are using smart speakers in significant numbers and for significant periods of time. There’s no doubt that radio broadcasters need to be available on these devices—and they need to sound good, perhaps better than competing services.

Great Audio Attracts Listeners

Thanks to thoughtfully applied audio processing, radio has traditionally sounded “better” than the original popular songs that are a mainstay of broadcast programming. Yes, there are many exceptions, and over-processing or a poor audio path can often drive listeners away. However, now we have a chance to get it more right than ever before when presenting consistently great quality audio streams to smart speaker listeners.

Audio Levels with Smart Speakers

When considering audio processing for streaming to smart speakers, the top-view parameters are similar to any other processing considerations.

Process—or “condition”—the audio to best suit the transmission medium.
Process the audio for the delivery speaker(s) and the predominant listening environment.

To reach smart speakers, our transmission medium is bitrate-reduced audio streaming via the Public Internet. Typical bitrates will most likely be between 32 kbps and 256 kbps for streaming audio. The most popular audio encoding algorithms are MP3 and various members of the AAC family, such as AAC-LC, HE-AAC, and HE-AAC v2. Decoders for these algorithms are now included with virtually all web browsers and device operating systems; there’s almost never anything extra to download to receive these coded streams.

MP3, AAC, and some other coding algorithms share a common trait—they use one or more psychoacoustic models to decide how best to reduce the bitrate required to deliver audio from end to end. It’s this psychoacoustic aspect to modern audio encoders that demands we pay attention to the audio we’re asking them to encode to a lower bitrate.

Solid Audio Advice

Much has been written on the topic of encoding for streaming already. There’s also a chapter entitled, “Audio Processing for Encoding,” that’s part of “Streaming University”—a complete video course on audio streaming technologies.

Cornelius Gould, Chief Algorithm Developer at Omnia Audio, advises to use audio processing that is specifically designed for subsequent audio encoding. “For streaming, we’re looking for a flexible and smart AGC section, followed by a well-behaved multi-band compression stage. The final stage of audio processing for streaming should be an intelligent look-ahead limiter. Don’t ever use an audio clipper when feeding to a streaming encoder!”

Gould continues, “When listening for good audio processing, if you get tired of listening, then you’re overdoing it. As broadcasters, we’re accustomed to a maxed-out sound. Streaming audio doesn’t have to be that way. Give the audio “lift” and low-level support, plus some multi-band energy, but don’t lay into the look-ahead limiters. Just let them hit the loudest peaks. If you’re into the limiters all the time you’ll produce a fatiguing long-term sound. We’d like for people to listen all day!”

"When listening for good audio processing, if you get tired of listening, then you’re overdoing it. As broadcasters, we’re accustomed to a maxed-out sound. Streaming audio doesn’t have to be that way. Give the audio “lift” and low-level support, plus some multi-band energy, but don’t lay into the look-ahead limiters."—Cornelius Gould

Leif Claesson, designer of the Omnia.7 and Omnia.9 audio processors, agrees and adds, “There’s a misconception that we have to compress more when the listener is hearing a small speaker. What we do want is a consistent volume in the long-term, but we really want to let short-term dynamics get through. This is different from AM or FM broadcast where we have to use clipping to be competitive. With streaming we can use the early stages of our processors to lift low-level audio, but don’t make constant use of the look-ahead limiter. Just touch it on higher audio peaks.”

“So much music is mastered inconsistently, and most modern tracks are way over-processed,” Claesson continues. “That’s why our Undo and Perfect Declipper are so popular among engineers. The Undo processor returns some pleasant dynamics to over-processed audio, resulting in less listener fatigue. Perfect Declipper restores clipped audio peaks, including the original audio harmonics that were also clipped in the track’s mastering process. When we get the audio clean again, prior to our own, tailored audio processing, the result on audio streams is dramatically better.”

“One important scenario is listening at a background audio level. Any pumping or multi-band inconsistency is especially annoying to hear at low listening levels. This is another reason why Undo and Declipping not only clean up over-processed source audio, but lets our processor provide a pleasing, long-term listening experience,” Claesson adds.

"One important scenario is listening at a background audio level. Any pumping or multi-band inconsistency is especially annoying to hear at low listening levels. This is another reason why Undo and Declipping not only clean up over-processed source audio, but lets our processor provide a pleasing, long-term listening experience.”—Leif Claesson

Using an Omnia.7 or an Omnia.9 audio processor—or using the Omnia.9 processing available in the Z/IPStream 9X/2 or the Z/IPStream R/2 with Omnia.9 processing—we can see history graphs of loudness at the processor Input and Output. Below we see the unprocessed audio on the left over a 15-minute period. On the right is the same period’s processed output. The soft ballad at the left of each graph was pleasingly processed for overall loudness, but one can see the brief excursions of soft and loud parts during this dramatic song. The following two songs were rhythmic and brought up consistently to a good level while maintaining a dynamic quality to the vocals and rhythm.

Audio monitor

One of the benefits of adjusting audio processing for smart speakers is the relatively small number of smart speaker models available at this time. Amazon offers about five different models including the Echo Dot, Echo 2, and the Echo Show. Google offers three basic models—the Google Home, Google Home Mini, and the new Google Home Max. Apple offers only the HomePod. The vast majority of market share in the smart speaker category is claimed by Amazon and Google with about 89 percent of the world smart speaker market. This relatively small number of models in common use suggests that most listeners will experience similar results from a given audio processing setup. In other words, some consistency in user experience is likely—at least more likely than on consumer PC speakers where the range is from barely acceptable to downright awful.

Gould advises to listen to your audio processing on the most typical smart speaker models. This will include both the original Amazon Echo (1st generation) and its successor, the Amazon Echo (2nd generation), as well as the Echo Dot. The Google Home Mini is the most widely deployed in Google’s stable. The biggest difference in the smaller (Dot and Mini) models is their lack of bass response. Their larger stablemates (the Amazon Echo and Google Home) have larger speakers and bass ports.

Mark Manolio, a senior member of the Telos Alliance Support Team, frequently helps broadcasters get Mark M their desired sound from their Omnia audio processors. His advice is similar to that of Cornelius Gould and Leif Claesson. “Begin with clean audio from your automation playout system; this is where great sound starts,” Manolio advises. “Check your audio library to make sure all your songs are recorded at similar levels. Many automation systems have built-in normalization utilities. Make sure none are distorted or otherwise compromised.”

Indeed, audio files in station audio libraries have been found to come from Napster or FileDonkey a dozen years ago. Mark says, “Get clean copies of your songs. This day and age all song files should be in a linear format that are normalized properly. This will positively impact both your on-air sound as well as your streaming.”

For the streaming audio processing, Manolio suggests, “Even though smart speakers are the hot item right now—and probably will be for a long time—we want our streams to sound good on PCs and smartphones, too. The same audio processing hygiene applies for all devices. Start clean, stay clean, process for long-term consistency, and let those peaks come through on your streaming audio. No crazy bass, and don’t crank the highs too much. Just a very consistent sound is our goal here. Of course, use the amazing features built-in to Omnia streaming processors—input conditioning and Undo on the Omnia.7 and Omnia.9—and the Perfect Declipper available in the Omnia.11 processor. All the Omnia processors use intelligent look-ahead limiting for HD Radio and stream processing for really nice loudness without any audio clipping. The new Omnia VOLT with HD Studio Pro software is another excellent choice for a dedicated streaming audio processor. Even our basic 3-band Omnia A/X-style processing in the Z/IPStream encoders is just terrific for both music and voice on your streams.”

“Even though smart speakers are the hot item right now—and probably will be for a long time—we want our streams to sound good on PCs and smartphones, too. The same audio processing hygiene applies for all devices. Start clean, stay clean, process for long-term consistency, and let those peaks come through on your streaming audio."—Mark Manolio, Telos Alliance

Leif Claesson notes that processing for streaming gives us more freedom than over-the-air. “With streaming we don’t have to process for ultimate loudness, and we don’t have to deal with FM pre-emphasis. Both of these are good for the final audio we hear on our smart speakers.”

Claesson sums up the goals for your streaming audio sound, “If your stream isn’t sounding better than your over-the-air transmission, you’re doing something wrong. Try processing a bit less, but use the great tools that Omnia processors provide. Condition and clean up the audio providing long-term consistency and loudness while giving a very dynamic feel to the audio moment by moment.”

Takeaway Points

Process—or “condition”—the audio to best suit the transmission medium.

Use Undo and/or Declipping functions, if available, to restore dynamics to over-processed source material.
Process your audio for "lift," long-term loudness, and spectral consistency.
Do not use any audio clipping.
Use the final look-ahead limiting, but don’t “lean into it.”
For lower bitrate encoding—under 64 kbps for MP3 and less than 48 kbps for AAC family algorithms—reduce the high frequencies, or at least reduce their density. Certainly don’t boost or over-process high frequencies.
Keep the long-term AGC level up, but allow the audio to “bounce” in the multi-band processors.

Process the audio for the delivery speaker(s) and the predominant listening environment.

Keep long-term loudness up using a smart AGC for a consistent volume in typical listening environments.
Short-term, allow audio peaks to punch through for a sense of dynamic range and liveliness, and staying clear of any fatiguing character.
Don’t over-boost the lows or the highs. Smart speaker designers have already contoured the frequency response to be as pleasing and effective as possible within their devices.
Listen to your stream(s) on several popular smart speakers to make sure you’re not over-processing the instantaneous dynamics, and that you’re not over-boosting bass or treble, causing the audio processing in (some) smart speakers to reduce your stream’s volume.

Kirk A. Harnack is a 40 year veteran of radio and audio engineering practice. His contract engineering business and part-ownership in over 20 radio stations beginning in 1993 have led and informed his journey toward his current position at the Telos Alliance. Kirk regularly facilitates large broadcast projects by connecting equipment capabilities, innovative implementations, and broadcasters’ operational requirements.

Contact Sales to Learn How to Optimize Your Streams with Z/IPStream

Telos Alliance has led the audio industry’s innovation in Broadcast Audio, Digital Mixing & Mastering, Audio Processors & Compression, Broadcast Mixing Consoles, Audio Interfaces, AoIP & VoIP for over three decades. The Telos Alliance family of products include Telos® Systems, Omnia® Audio, Axia® Audio, Linear Acoustic®, 25-Seven® Systems, Minnetonka™ Audio and Jünger Audio. Covering all ranges of Audio Applications for Radio & Television from Telos Infinity IP Intercom Systems, Jünger Audio AIXpressor Audio Processor, Omnia 11 Radio Processors, Axia Networked Quasar Broadcast Mixing Consoles and Linear Acoustic AMS Audio Quality Loudness Monitoring and 25-Seven TVC-15 Watermark Analyzer & Monitor. Telos Alliance offers audio solutions for any and every Radio, Television, Live Events, Podcast & Live Streaming Studio With Telos Alliance “Broadcast Without Limits.”