Bridging the Web's Structure Gap: The Journey from HTML to Semantic Data

The Web’s Original Design: Human-Readable, Not Machine-Readable

Since the mid-1990s, the web has primarily served as a platform for sharing documents meant for human eyes. These documents are built with HTML, a language that offers only basic structural cues—like indicating a paragraph or emphasizing a word. Add a dash of CSS to style those paragraphs with tiny gray sans-serif text, and you might look trendy—until older readers struggle to decipher your design and click away. That’s the extent of “structure” on most of the web today.

Bridging the Web's Structure Gap: The Journey from HTML to Semantic Data — Source: www.joelonsoftware.com

Consider a common scenario: you mention a book on a webpage. You might write: Goodnight Moon by Margaret Wise Brown, illustrated by Clement Hurd, Harper & Brothers, 1947, ISBN 0-06-443017-0. To a human, that’s clear. But a naive computer program scanning the page wouldn’t recognize the reference as a specific book. All it sees is bold text—no indication of an author, illustrator, or ISBN. The lack of machine-readable structure limits what computers can do with the information.

The Semantic Web Vision: Making Data Understandable to Machines

As early as 1999, Tim Berners-Lee, the inventor of the web, foresaw a solution. In his book Weaving the Web, he wrote: I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which makes this possible, has yet to emerge, but when it does… the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines.

This vision, known as the Semantic Web, would require adding extra layers of markup to web pages—markup that explicitly tells computers what each piece of data means. For instance, rather than just bold text for a book title, you could annotate it with details like author, publication year, and ISBN, all in a format that software can parse and connect to other data sources.

Schema.org and Standard Vocabularies

To make the Semantic Web practical, initiatives like Schema.org were created. Schema.org provides a shared vocabulary that webmasters can use to describe common entities—books, events, people, recipes, and more. The idea is simple: look up how Schema.org defines a “Book,” then use that definition to add structured data to your HTML.

Several formats exist to embed this markup, including RDFa, Microdata, and JSON-LD (a lightweight JSON-based format). These allow you to say, Hey, this bold text is actually a book with an author named Margaret Wise Brown. When done correctly, search engines and other applications can understand the content’s meaning and display rich snippets or power intelligent agents.

Why Semantic Markup Hasn’t Taken Off (Yet)

Despite the clear benefits, adopting semantic markup remains uncommon. The primary obstacle? It’s hard. After you’ve crafted a beautiful, human-readable blog post, the last thing you want to do is dive into schema documentation, learn RDF or JSON-LD syntax, and manually tag every piece of data. It feels like homework.

Unless a computer is already parsing your site (like a search engine crawler), the effort often seems wasted. Most web creators give up, leaving the vast majority of web pages without structured data. As a result, the Semantic Web vision remains largely unrealized, even two decades after Berners-Lee’s prophecy.

But the potential rewards are enormous. If we could make the web’s information readily accessible to machines—including AI systems and traditional programs—we could unlock new levels of automation, knowledge discovery, and seamless interaction. Human progress depends on getting more information into formats that are easy for both people and computers to use.

A Modern Approach: The Block Protocol

Recognizing that people will only add semantic markup if doing so is easy, recent projects aim to lower the barrier. One such initiative is the Block Protocol, a framework that reimagines how web content is structured. Instead of expecting creators to manually annotate HTML, the Block Protocol provides a system of reusable, self-describing “blocks” that carry their own schema. When you embed a book block, for example, it automatically includes all the necessary markup—ISBN, author, cover image, and more—in a machine-readable format.

This approach integrates structured data directly into the content creation workflow. You don’t need to learn RDF or JSON-LD; you simply use a block that knows how to describe itself. The protocol also ensures interoperability, so blocks from different sources can communicate and share data. Progress on the Block Protocol is steadily advancing, with growing adoption among developers and content platforms.

How Block Protocol Simplifies Structured Data

Inherent semantics: Each block defines its own structured data, so every piece of content is automatically computer-readable.
No extra work: Creators focus on writing, not markup. The block handles the schema.
Interoperable: Blocks can exchange data, enabling richer applications without manual integration.

Looking Ahead: A Smarter Web for Everyone

The journey from plain HTML to a fully semantic web has been slow, but it’s gaining momentum. The Block Protocol represents a practical step toward making Berners-Lee’s dream a reality—not by demanding more from content creators, but by making structure a natural part of how we build for the web. As more sites adopt these blocks, we’ll see a web where computers truly understand the information they process, leading to smarter search results, automated assistants, and new forms of digital collaboration.

The technology exists. The vision is clear. Now, it’s about making it effortless enough that everyone participates. And with initiatives like the Block Protocol, that future is closer than ever.

Tags: