.html

HyperText Markup Language

HTML defines every web page's structure using markup tags that browsers render visually. Standardized by W3C since 1993 and now maintained by WHATWG as a living standard, HTML has no version number anymore — the spec updates continuously as browsers ship new features.

File structure

Header schema

Records structured data

TextW3C1991

Not convertible

Conversion not yet available. HTML to PDF rendering requires a browser engine — a feature planned for a future update.

Common questions

What is an HTML file and what is it used for?

An HTML file is a plain-text document containing HyperText Markup Language tags that define the structure of a web page. Browsers read these tags and render the visual page you see — headings, paragraphs, images, links, and forms. Every website on the internet is built on HTML. The file is editable in any text editor and viewable in any web browser.

Why is it called index.html?

Web servers return a default document when a URL points to a directory. The convention of naming that file index.html originated with the NCSA HTTPd server in 1993, the first widely deployed web server. Apache, Nginx, Vercel, and virtually every hosting platform still follow this convention today.

What is the difference between .html and .htm?

They are functionally identical. The .htm extension dates back to MS-DOS and early Windows, which enforced a three-character extension limit (8.3 filenames). When long filenames became standard with Windows 95, .html took over as the preferred extension. Web servers return the same text/html MIME type for both.

Is HTML a programming language?

No. HTML is a markup language that describes document structure and content using tags. It cannot perform calculations, execute logic, or control program flow. Programming languages like JavaScript, Python, and Java handle computation. HTML provides the structural skeleton that programming languages and CSS bring to life.

How do I view the HTML source of a web page?

Right-click anywhere on the page and select View Page Source, or press Ctrl+U on Windows and Linux or Cmd+Option+U on macOS. For the live DOM including JavaScript-generated content, press F12 to open browser developer tools and inspect the Elements panel.

What is the difference between HTML and HTML5?

HTML5 was a W3C Recommendation published in 2014 that added semantic elements, native audio and video, and canvas drawing. Since 2019, the sole specification is the WHATWG HTML Living Standard, which has no version number and is updated continuously. The term HTML5 now informally means modern HTML.

Can I open an HTML file without a browser?

Yes. HTML files are plain text, so any text editor opens them directly — VS Code, Notepad++, Sublime Text, vim, or even Windows Notepad. You will see the raw markup tags instead of the rendered page. Terminal commands can also display HTML source.

How do I create an HTML file?

Open any text editor, type the basic structure starting with the DOCTYPE declaration, html, head, and body tags, then save the file with a .html extension. Open the saved file in a browser to see the rendered result. No special software or compiler is required.

What makes .HTML special

18 tags

The entire web started with 18 HTML elements

Berners-Lee's 1991 HTML description at CERN defined just 18 tags — headings, paragraphs, lists, and anchors. The current Living Standard defines over 110 elements, but every page still uses those original building blocks.

No version number

The current HTML spec has no version

Since the 2019 W3C/WHATWG agreement, HTML is maintained as a Living Standard with continuous updates and no version number. 'HTML5' is a frozen 2014 snapshot. When developers say HTML5, they usually mean modern HTML.

View source

Any page's HTML source is one right-click away

Berners-Lee designed the web as an open system. Right-click View Page Source in any browser reveals the raw HTML. This transparency taught generations of developers to code by reading real-world pages.

index.html

The default homepage name dates to 1993

The NCSA HTTPd server in 1993 established the convention of serving index.html when a URL points to a directory. Apache, Nginx, Vercel, and virtually every hosting platform still follow this 30-year-old convention.

The language of the web

Beneath every website, email newsletter, and web application lies an HTML document. HyperText Markup Language is not a programming language — it cannot perform calculations, loop over data, or make decisions. It is a markup language: a system of tags that describes what content is and how it relates to other content. A browser reads those tags and constructs a visual page. This distinction matters because it defines what HTML can and cannot do on its own, and why CSS and JavaScript exist as separate layers.

Continue reading — full technical deep dive

Origin: Berners-Lee and the 18 tags

Tim Berners-Lee wrote the first HTML description at CERN in 1991 as part of his proposal for a "WorldWideWeb" information management system. That initial document defined roughly 18 elements — headings (h1 through h6), paragraphs, lists, anchors (hyperlinks), and a handful of text-level tags. There was no formal specification, no standards body, and no version number. The first web browser and the first web server both ran on a NeXT workstation in Berners-Lee's office. HTML was designed to be simple enough that a physicist could mark up a research paper without specialized tools — a design philosophy that persists in the language's tolerance for malformed markup.

Document structure: DOCTYPE, head, body

A valid modern HTML document follows a required structure. The document begins with a DOCTYPE declaration — in current HTML, simply <!DOCTYPE html> — which instructs the browser to use standards mode rather than quirks mode. The root element is <html>, typically carrying a lang attribute for accessibility and SEO. Inside <html>, two children exist: <head> and <body>.

The <head> element contains metadata invisible to users: the document title (shown in browser tabs and search results), character encoding declaration (meta charset="utf-8"), references to external CSS and JavaScript files, Open Graph tags for social media previews, and favicon links. The <body> element contains everything the user sees and interacts with: text, images, forms, tables, and embedded media.

This head/body split is fundamental to how the web works. Search engine crawlers read <head> metadata to understand what a page is about before parsing the visible content. Browsers begin rendering <body> content as it arrives, even before the full document has downloaded — a behavior called progressive rendering that makes HTML uniquely suited for networked delivery.

Encoding: why UTF-8 won

The WHATWG specification strongly recommends UTF-8 for all HTML documents. Early HTML files were typically encoded in ISO-8859-1 (Latin-1) or Windows-1252, which cover Western European characters but cannot represent Arabic, Chinese, Japanese, Korean, Cyrillic, or most other scripts. UTF-8 encodes the full Unicode range in a variable-width format compatible with ASCII.

Encoding is declared via a meta element in the head: <meta charset="utf-8">. If this declaration is absent, browsers fall back to heuristic detection or the encoding specified in the HTTP Content-Type header. Mismatched encoding — a file saved as UTF-8 but declared as ISO-8859-1, or vice versa — produces mojibake: garbled characters where accented letters, Arabic script, or CJK ideographs should appear. This remains one of the most common HTML debugging issues.

Semantic HTML: meaning over appearance

The HTML5 era introduced semantic elements that describe their content's purpose rather than its appearance. Before HTML5, developers used generic <div> elements with CSS classes for everything. The semantic set includes <header>, <footer>, <main>, <nav>, <aside>, <article>, <section>, <figure>, <figcaption>, and <time>.

Semantic markup serves two audiences. Screen readers use element types to build a navigable page outline — a <nav> element is announced as navigation, an <article> as a self-contained composition. Search engines use heading hierarchy (h1 through h6) and semantic containers to understand content structure, which directly influences ranking. Google's documentation explicitly recommends semantic HTML for SEO.

The WHATWG Living Standard: no more version numbers

The HTML specification has had a turbulent governance history. From 1995 to 1999, the IETF and then W3C published numbered versions: HTML 2.0, 3.2, 4.0, and 4.01. In 2000, the W3C pivoted to XHTML — an XML-strict serialization of HTML that required well-formed documents and broke backward compatibility. When the W3C began work on XHTML 2.0, which would have been incompatible with existing web content, browser vendors objected.

In 2004, Apple, Mozilla, and Opera formed the Web Hypertext Application Technology Working Group (WHATWG) to develop a practical evolution of HTML 4. Their work became HTML5, which the W3C eventually adopted and published as a Recommendation in October 2014. But by 2017, the W3C's last numbered snapshot (HTML 5.2) was already diverging from the WHATWG's continuously updated specification.

On 28 May 2019, the W3C and WHATWG signed a memorandum of understanding: the WHATWG HTML Living Standard would be the single authoritative HTML specification going forward. W3C would no longer publish independent HTML versions. This means the term "HTML5" is technically a historical artifact — a frozen snapshot from 2014. The current specification has no version number and is updated continuously. When developers say "HTML5," they usually mean "modern HTML as defined by the Living Standard," but the distinction matters for standards compliance and specification references.

The index.html convention

Web servers serve a default document when a URL points to a directory rather than a specific file. The nearly universal convention is to look for a file named index.html. This practice originated with the NCSA HTTPd server in 1993, the first widely deployed web server software. When a user visits https://example.com/, the server returns the contents of /index.html from the document root.

The convention persists across Apache, Nginx, Cloudflare Pages, Netlify, Vercel, GitHub Pages, and virtually every hosting platform. Alternative default filenames (index.htm, default.html, default.asp) exist in some server configurations, but index.html remains dominant. Understanding this convention is essential for static site deployment and explains why the homepage of most websites lives in a file named index.html.

View-source: the web's open classroom

One of HTML's most distinctive cultural features is transparency. Any browser can display the raw HTML source of any web page — right-click, "View Page Source," or type view-source: before a URL. This openness was intentional. Berners-Lee designed the web as an open system where anyone could learn by reading existing pages and copying patterns. Entire generations of web developers learned HTML not from textbooks but from viewing the source of sites they admired. Browser developer tools (F12) extend this further, showing the live DOM, CSS rules, network requests, and JavaScript execution in real time.

Security: XSS and the trust boundary

HTML is the primary attack surface for web security vulnerabilities. Cross-Site Scripting (XSS) — ranked in the OWASP Top 10 — occurs when an attacker injects malicious HTML or JavaScript into a page viewed by other users. The injection point is typically unsanitized user input rendered into innerHTML, href attributes, or event handlers. A successful XSS attack can steal session cookies, redirect users to phishing pages, or modify page content.

Defenses include Content Security Policy (CSP) headers that restrict which scripts can execute, input sanitization libraries (DOMPurify, Bleach), and the HttpOnly cookie flag that prevents JavaScript from accessing authentication tokens. The <iframe> element introduces additional risks: clickjacking attacks overlay transparent frames on legitimate UI elements to trick users into unintended actions. The X-Frame-Options header and frame-ancestors CSP directive mitigate this vector.

The .html versus .htm distinction

Both extensions are functionally identical. The three-character .htm variant exists because MS-DOS and Windows 3.x enforced an 8.3 filename limit — eight characters for the name, three for the extension. Files could not have a four-character extension like .html. When Windows 95 introduced long filename support, .html became the standard, but .htm persists in legacy systems, older Microsoft toolchains, and some server configurations. Web servers return the same text/html MIME type for both extensions.

.HTML compared to alternatives

.HTML compared to alternative formats
Formats	Criteria	Winner
.HTML vs .XML	Parser error handling HTML uses a lenient error-recovery parser that renders malformed markup gracefully — unclosed tags, missing attributes, and improper nesting are auto-corrected. XML requires strict well-formedness; a single error produces a fatal parse failure with no rendered output.	HTML wins
.HTML vs .PDF	Editability and reflow HTML is plain text editable in any text editor, and content reflows automatically to fit any screen width. PDF is a fixed-layout binary format requiring specialized tools to modify, with no native responsive reflow capability.	HTML wins
.HTML vs .MARKDOWN	Expressiveness HTML supports forms, tables with merged cells, embedded media, interactive elements, ARIA accessibility attributes, and arbitrary nesting. Markdown covers headings, lists, links, and basic formatting but requires inline HTML for anything beyond its limited syntax.	HTML wins

Technical reference

Specs CLI Conversions Security Ecosystem References

MIME Type: text/html
Developer: World Wide Web Consortium (W3C) / WHATWG
Year Introduced: 1993
Open Standard: Yes — View specification

Binary Structure

HTML is a plain-text format with no binary structure, no magic bytes, and no fixed file header. Documents typically begin with the ASCII string `<!DOCTYPE html>` followed by the `<html>` root element. UTF-8 is the recommended encoding; a UTF-8 BOM (bytes EF BB BF) may precede the DOCTYPE but is discouraged by the WHATWG specification. Browsers identify HTML through content sniffing — the WHATWG MIME Sniffing Standard defines the algorithm — while PRONOM identifies HTML generically by the presence of `<html>` or `<body>` tags within the first bytes. Line endings (CR, LF, or CRLF) are normalized by parsers and have no semantic effect. Elements form a tree structure (the DOM) when parsed, with `<html>` as the root, `<head>` and `<body>` as its two children, and all visible content nested inside `<body>`.

1991Tim Berners-Lee publishes the first HTML description at CERN, defining roughly 18 elements for the World Wide Web1993NCSA HTTPd web server establishes the index.html convention; Mosaic browser brings HTML to mainstream audiences1995HTML 2.0 published as RFC 1866 — the first formal specification, adding forms and image support1999HTML 4.01 (W3C Recommendation) introduces CSS separation, accessibility attributes, and the scripting framework2004Apple, Mozilla, and Opera form WHATWG after rejecting W3C's XHTML 2.0 direction, beginning work on HTML52008W3C adopts the WHATWG HTML5 draft, adding canvas, video, audio, and semantic elements to the specification2014HTML5 published as a W3C Recommendation on 28 October, formalizing the semantic web and multimedia capabilities2019W3C and WHATWG sign agreement on 28 May: the HTML Living Standard becomes the sole authoritative specification

Validate HTML against the Living Standard other

html5validator --root ./public --also-check-css --log INFO

Runs the Nu Html Checker (vnu.jar) against all HTML files in the ./public directory. The --also-check-css flag validates embedded and linked CSS. Returns non-zero exit code on errors, suitable for CI/CD pipelines.

Clean and reformat malformed HTML with Tidy other

tidy -q -m -utf8 --wrap 0 --indent auto page.html

HTML Tidy repairs structural errors (unclosed tags, improper nesting), converts encoding to UTF-8, and reformats with consistent indentation. The -m flag modifies the file in place; -q suppresses informational messages.

Convert HTML to Markdown with Pandoc other

pandoc -f html -t gfm --wrap=none -o output.md input.html

Extracts structured content from an HTML file and converts it to GitHub Flavored Markdown. The --wrap=none flag prevents Pandoc from inserting hard line breaks, preserving paragraph flow for documentation workflows.

Fetch and inspect HTTP headers for an HTML page other

curl -sI https://example.com | head -15

Retrieves only the HTTP response headers from a URL, showing Content-Type (text/html), encoding, caching directives, and security headers (CSP, X-Frame-Options) without downloading the page body.

Minify HTML for production deployment other

npx html-minifier-terser --collapse-whitespace --remove-comments --minify-css true --minify-js true -o output.html input.html

Removes comments, collapses whitespace, and minifies inline CSS and JavaScript in a single pass. Reduces file size for production deployment without altering the rendered output.

Conversion not yet available. HTML to PDF rendering requires a browser engine — a feature planned for a future update.

HIGH

Attack Vectors

Cross-Site Scripting (XSS)
Iframe clickjacking
HTML injection / phishing forms
Script injection via event handlers

Mitigation: FileDex processes HTML files entirely within the local browser environment — no file uploads to servers, no external resource loading, and no script execution from analyzed files. Content Security Policy headers restrict inline script execution.

VS Code tool

Microsoft's open-source editor with built-in HTML IntelliSense, Emmet abbreviation expansion, and Live Server extension for real-time preview

HTML Tidy tool

CLI tool for cleaning, repairing, and reformatting malformed HTML documents, maintained by the HTACG community

Pandoc tool

Universal document converter supporting HTML to Markdown, DOCX, PDF, LaTeX, and dozens of other format conversions

htmlparser2 library

Fast, forgiving HTML and XML parser for Node.js with streaming support and DOM/SAX event interfaces

Beautiful Soup library

Python library for parsing HTML and extracting data from web pages, widely used for web scraping and data extraction workflows

DOMPurify library

XSS sanitizer for HTML, MathML, and SVG that removes dangerous content while preserving safe markup, used by major web applications

Nu Html Checker (vnu.jar) tool

Official W3C/WHATWG HTML validator used by html5validator CLI, checks documents against the HTML Living Standard

WHATWG HTML Living Standard spec

The single authoritative HTML specification, continuously maintained by WHATWG since the 2019 W3C agreement, with no version numbers

Specification WHATWG HTML Living Standard
Specification W3C HTML5.2 Recommendation (frozen snapshot, December 2017)
Registry IANA Media Type: text/html (RFC 2854)
Registry Library of Congress Format Description — HyperText Markup Language Format Family (fdd000475)
History W3C/WHATWG Memorandum of Understanding (28 May 2019)
History HTML — Wikipedia