Preston So is a product leader at Oracle with a background steeped in web and print design, content strategy and creation, and development. He’s an editor at A List Apart, columnist for CMSWire, and contributor to Smashing Magazine.
He recently released his third book, Voice Content and Usability, in which he marries his Linguistics degree with his multi-discipline career arc to shine a long-overdue light on the intricacies of creating and delivering spoken nonvisual content.
“Linguistics has always been one of those fields that occupies a very strange area in the consciousness of technology,” So said. “Today, though, I feel like linguistics is finally having a bit of a heyday when it comes to voice technology and conversational technology.”
We spoke about his latest book, including what inspired it, how he sees voice content impacting the future of accessibility, and his advice for designing content in a socially conscious way.
How do you define voice content in your book?
Voice content, simply put, is content delivered through the medium of voice, but I think a lot of nuances in that definition bear digging into.
One of those is the fact that not all voice interfaces are necessarily alike when it comes to how they deliver content. Some voice interfaces have a visual equivalent, like the Amazon Echo Show, which has a screen and the potential for captioning.
My book focuses on the subset of voice interfaces that we call pure voice interfaces, which have no visual component whatsoever. Every delivery of this content is mediated entirely through voice and oral and verbal means.
What got you into writing a book on this topic?
This book was a really great example of being able to marry my strong interest in the dynamics of language and the way that language operates—especially when it comes to figuring out ways to drive algorithms and natural language processing—but also from the standpoint of how humans speak and use language.
I always like to quote [content strategy expert] Erika Hall in her book, Conversational Design. She wrote, “Conversation is not a new interface. It’s the oldest interface.” An interesting aspect of not only voice content but also interacting with interfaces through speech is that for all intents and purposes, a lot of the interfaces that we have today—like computer keyboards, mice, and video game controllers—are artificially learned behaviors. We don’t grow up having this innate sense of how to use a computer mouse or how to scroll on a screen.
I had the opportunity to work on AskGeorgiaGov, which was the first-ever voice interface for residents of the state of Georgia. That project was both a source and an outlet for me to work on my interest in voice and language. The main motivation for this book is to share what we learned over the course of the project and the development of these new fields of voice content strategy and voice content design in ways that are in sync with (but also very much in opposition to) the ways that content design and content strategy work today.
Your book mentions that one of the key differences of voice content from traditional written content is the importance of it being contextless. Can you expand on that?
I started out as a web and print designer, and one of the things I find interesting about the transition from print mediums to the web—the transition from broadsheets to browsers—is this interesting innovation that happened with the web. It’s not so often now that you have to flip through pages and look at content in a flat, static approach. Web content today is rooted in the motifs, tropes, and fixtures of the web. Links, nav bars, calls to action, breadcrumbs, site maps—all of these things that we as people who work with content always have to be aware of, but these are unavailable in voice. You don’t have a visual context in voice or the ability to hit “back.” You don’t have the ability to look at a site map and situate where you are in relation to particular content when you’re in a pure-voice setting.
What we have to do nowadays is not so much sift through our content using the typical visual trappings of the web, but rather move toward a way of interacting with content where navigation becomes more of an active negotiation. Because the web has so overwhelmingly been the way that we interact with content these days, especially when it comes to our roles in the technology industry, I think we’ve given less attention to some of the things that are coming over the horizon—these voice and immersive experiences.
I recently wrote an article for A List Apart called “Immersive Content Strategy” about augmented and virtual reality content and how that completely reinvents these ideas of legibility and discoverability.
Was the importance of creating contextless content a daunting realization as you were working on the AskGeorgiaGov project?
For AskGeorgiaGov, we learned quickly that even though the Georgia.gov website is so well semantically defined and structured, a lot of our content destined for voice interfaces had to move away from this notion of what I call macro-content and long-form content into what [technology writer] Anil Dash calls microcontent—these contextless blobs or chunks of content that don’t really have much of a relation to each other.
Everyone wants to talk about omnichannel content today and how important it is to have a single source of truth for content that you can display anywhere, but what do you do about that nuance between spoken content and written content, and macro-content and microcontent? It’s an interesting set of debates that I’m excited to have with a lot more folks as this book comes out.
How is collaboration on voice content different from what you’ve experienced with other more traditional content projects?
Because these voice interfaces are so newfangled, a lot of us don’t have experience with them as designers or developers or even product managers. It’s really difficult to do anything that’s content-first in voice content without having an interdisciplinary focus across engineering functions, content functions, and design functions because a lot of these decisions are so interdependent and interrelated.
The other side of the coin is that web content is fundamentally a written realm. We have to take this amazing corpus of written content we’ve amassed over the years and now somehow let it speak for itself in a really elegant way. We have to reorient our minds to think about how content sounds and not how content reads.
How do you see voice content impacting accessibility beyond current screen readers?
As a linguist and somebody who looks at the ways language influences and impacts our worldviews, one of the things I think is really exciting about voice interfaces is the accessibility potential and the assistive potential they have.
In the book, I quote Chris Maury, who’s a voice interface designer and an accessibility advocate. He states from his experience as a person who is blind what it is that screen readers do that’s so painful for people who can’t navigate a lot of these structures of the web in the ways we expect them to. He talks about how it doesn’t make any sense for websites to be translated directly by screen readers in this really obtuse way that forces people to listen through every link and every image. Plus you have to wait for that “skip to main content” utterance. All these things are quite annoying for those who navigate websites using screen readers.
A lot of people are seeing the potential of voice interfaces that aren’t tied to any visual connection or any website, and AskGeorgiaGov is one of those examples. That’s one of the reasons why the State of Georgia came to us and said we really need to make some kind of alternative for the website. It delivers content in a nice way, but we really want to serve some of the residents in our state in a different capacity. Georgia has always been at the forefront of web accessibility and they’re also very much at the forefront of accessibility with these alternative conduits for content.
When it comes to voice content, I think the biggest promise it holds is for people to be able to use voice assistance as an experience that’s much more in tune with how we really want to efficiently access some of the content we need. As opposed to navigating a labyrinth of commands, utterances, and announcements in our website screen readers, which can be really challenging and inefficient for people to deal with.
But voice interfaces aren’t an end-all, be-all solution for accessibility. I think one of the important things to note about pure voice interfaces with no visual equivalent is that deaf people, for example, can’t use them. And that’s one of the things I think becomes a really interesting challenge.
Just like web content, voice content can lock out a certain demographic of folks who really need to have access to the content. I’m interested to hear more about some of the emerging ideas around multimodal accessibility, where you have these tightly curated approaches for people who have very distinct needs. Because, as we know, nobody’s requirements for content are one and the same.
What opportunities do you see for content designers and strategists to make a positive impact in voice content, particularly when it comes to representation or inclusion?
I think one of the biggest challenges we face in society in general is this concentration of privilege within a lot of the corporations and demographics that have already been well served by technology since the very beginning, and not so much the underserved communities and folks who need to have the same access to content.
[Content designers] can certainly work harder to advocate for and fight for [others] through the work that we do, and I think a lot of that has to do with encoding some of those [advocacy principles] into the content we put together and deliver.
One thing I advise for people who are just getting into voice content strategy and voice content design is to think about how you can encode and fight in the ways that are within your control. Because a lot of the things we do with voice aren’t within our control necessarily. Are there ways [you can advocate] through the content, user experience, interface texts—the ways in which you direct people to certain types of content or help people find content that caters directly to those specific people’s needs and, more importantly, the way they hear themselves speak?
I think one of the biggest challenges we face is reaching this notion of what I say in Chapter 6, and what [design innovator and Fjord cofounder] Mark Curtis calls the conversational singularity, where people will be able to have a conversation with a voice interface that’s indistinguishable from having a conversation with a fellow human being. My question in response to that is, “for whom?” For whom is that an indistinguishable conversation?
With existing voice interfaces, why are we biasing ourselves toward a sterile, clinical approach where we only hear general American English spoken by what is ostensibly a cisgender white woman, as opposed to somebody who might be code-switching between African-American Vernicular English and other modes of speech—queer and straight-passing modes of speech, English, and Spanish?
Are there ways to introduce and interpolate some of these usages and ways of speaking that reflect the richness of the demographic and the population you’re trying to serve that you can really represent within the voice interface itself? That can be in the dialogues or in some of the ways in which you kick off the conversation to establish that initial rapport.
I would love to hear, for example, a chatbot or a voice interface one day that’s able to speak to bilingual people in New York City who speak both English and Spanish or English and Bengali and do those same toggles that you hear at home and on the streets. That’s when we’ll actually reach that conversational singularity and a really nice outcome for everybody in the picture.
The idea of being able to have a conversation with a machine that’s as fluid as a natural conversation feels unattainable.
It’s a bit reductive, too. The way you and I have conversations with our friends is very different from the ways in which I think a lot of people who are excited about the conversational singularity might have conversations.
It’s really important we don’t lose sight of that because I think a lot of the mistakes we see around the technology industry and disinformation—for example, with some of these social media experiments—really reflects this danger of a monoculture in how we deliver and represent content.
The more we can reflect and mirror the richness of our world in these voice interfaces, the better off we all are because it establishes trust. It helps people understand that they’re speaking to someone who actually sounds a bit like them. We want to design things that help people see themselves in those interfaces too. It really is one of those things that actually, in my opinion, gives people a voice of their own, as opposed to relying on somebody else’s voice.