The rise of natural language AI has inspired bold predictions about the end of traditional user interfaces (UIs). Menus, buttons, and screens, some argue, will vanish—replaced by conversation. On the surface, it sounds ideal: no more learning custom layouts, just seamless dialogue.

But this vision ignores how people actually think, explore, and communicate.

Assumption 1: Conversational Interaction Is Always Practical

The first flawed assumption is that people can always speak openly with AI. In reality, context and privacy matters. Imagine looking for a date on Hinge on a crowded bus—are you really going to say aloud, “Show me matches who share my taste in trashy TV”? Public spaces demand privacy and subtlety.

Visual interfaces provide this discretion. A glance, tap, or swipe communicates intent instantly, without broadcasting your personal life. Until cultural norms change dramatically, voice alone will remain impractical in many situations.

Assumption 2: People Always Know What They Want

Natural language works well for precise tasks: “Play my morning playlist” or “Book a flight to New York.” But much of digital life isn’t about commands—it’s about discovery.

Take Spotify: even in the streaming era, artists and platforms still design covers, animations, and visuals. Why? Because visuals drive exploration. They spark emotion, shape identity, and help listeners recognize music they didn’t know they wanted. Discovery isn’t functional— it’s experiential, and visuals are central to sparking emotion and driving engagement.

Assumption 3: Voice Is Superior to Visuals

Humans are wired for visuals. We process images far faster than text or speech, and visuals carry cultural and emotional depth that words alone cannot. Our reliance on imagery predates written language itself—cave paintings, gestures, symbols.

Think about your last team workshop: people drew diagrams, clustered stickies, mapped journeys. If spoken language alone were enough, none of that would be necessary. But we don’t rely on monomodal communication with each other—so why would we expect it to work better with AI?

This is why even now, visuals anchor digital experiences. A voice description like “moody, atmospheric band with surreal themes” will never resonate as strongly as seeing purposely designed album art with the intent to express personality.

Where Natural Language Shines

None of this diminishes the power of conversation. Natural language excels where:

  • Speed matters (hands-free commands).
  • Accessibility is essential (for people with disabilities).
  • Flexibility is needed (complex queries, troubleshooting).

The real promise lies in merging modalities. Imagine asking, “Show me relaxing music for a rainy day”—and the AI responds not only with words, but a rich visual gallery of playlists and artwork.

As a UX design leader, I’ve seen that no single mode of interaction fully satisfies people. Real innovation comes from blending strengths—conversation for speed and flexibility, visuals for emotion and exploration. The best interfaces mirror human behavior: multimodal by nature.

It’s ironic that in striving to create artificial intelligence, we talk about reducing interaction to a single mode. Humans have always been multimodal—combining speech, gesture, and imagery. If AI is meant to reflect us, then its interfaces should embrace that diversity, not strip it away.

Conclusion: Augmentation, Not Replacement

The future isn’t about replacing UIs—it’s about designing conversations and visuals that work together seamlessly. The most meaningful digital experiences with AI will always be multimodal—because people are.