.html

HyperText Markup Language

HTML defines web page structure using markup tags that browsers render visually. FileDex provides local HTML analysis and format reference directly in your browser — no file uploads, no server processing.

Learn more ↓

File structure

Header schema

Records structured data

Markup LanguageText FormatUTF-8W3C / WHATWG1991

By FileDex

Not convertible

Markup format. Conversion is not applicable.

Common questions

Can I open an HTML file without a web browser?

Yes. HTML files are plain text, so any text editor (VS Code, Notepad++, Sublime Text) opens them directly. You will see the raw markup tags instead of the rendered page. Terminal tools like `cat` or `less` also display HTML source.

What is the difference between .html and .htm file extensions?

They are functionally identical. The .htm extension dates back to MS-DOS and Windows 3.1, which enforced a three-character extension limit. Modern systems treat both extensions the same — servers return `text/html` for either one.

Is HTML a programming language?

No. HTML is a markup language — it describes document structure and content but cannot perform logic, loops, or calculations. Programming languages like JavaScript add interactivity and computation to HTML pages.

How do I check if my HTML file is valid?

Use the W3C Markup Validation Service at validator.w3.org, or run html5validator from the command line. These tools check for unclosed tags, missing required attributes, and deprecated elements against the HTML Living Standard.

What makes .HTML special

What is an HTML file?

HTML (HyperText Markup Language) is the standard markup language for documents displayed in web browsers. It defines the structure and content of web pages using elements (tags) like headings, paragraphs, links, images, and forms. HTML was invented by Tim Berners-Lee in 1991 and is now maintained as a Living Standard by WHATWG, meaning it evolves continuously rather than in numbered releases.

Continue reading — full technical deep dive

Every page on the web is ultimately an HTML document. When you visit a URL, your browser receives an HTML file and renders it visually. CSS controls appearance, and JavaScript adds behavior — but HTML is the foundation that makes a document a webpage.

How to open HTML files

Any web browser (Chrome, Firefox, Edge, Safari) — Double-click to render as a web page
VS Code (Windows, macOS, Linux) — Code editing with live preview via extensions
Notepad++ (Windows) — Syntax-highlighted editing
Sublime Text (Windows, macOS, Linux) — Fast code editor

Technical specifications

Property	Value
Current Version	HTML5 (Living Standard)
Encoding	UTF-8 (recommended)
Type	Markup language
Standard	WHATWG Living Standard
MIME type	`text/html`
Related	CSS (styling), JavaScript (behavior)

Common use cases

Web pages: Every website is built on HTML
Email templates: HTML-formatted emails with rich formatting
Documentation: Technical docs, help files, and manuals
Web applications: Single-page applications (SPAs) use a single HTML shell
Progressive Web Apps (PWAs): Installable apps built on web technologies

HTML document structure

A minimal valid HTML5 document:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Page Title</title>
</head>
<body>
  <h1>Hello, World</h1>
  <p>This is a paragraph.</p>
</body>
</html>

The <!DOCTYPE html> declaration tells the browser to use standards mode. The <head> contains metadata not shown to users (title, character set, stylesheets), and <body> contains the visible content.

Semantic HTML

HTML5 introduced semantic elements that describe their content's meaning to both browsers and search engines:

<header>, <footer>, <main>, <nav>, <aside> — Page structure
<article>, <section> — Content grouping
<figure>, <figcaption> — Images with captions
<time datetime="2024-01-15"> — Machine-readable dates

Semantic HTML improves accessibility (screen readers understand the page structure) and SEO (Google better understands content hierarchy).

HTML and SEO

Search engines read HTML directly. Key elements that affect ranking:

<title> — Shown in search result titles
<meta name="description"> — Search snippet text
<h1>–<h6> heading hierarchy — Signals content structure
alt attributes on images — Enables image indexing
<link rel="canonical"> — Prevents duplicate content penalties

Accessibility

Well-written HTML is inherently accessible. Use alt text on all images, <label> elements for form inputs, logical heading order (h1 before h2), and ARIA attributes (role, aria-label) for custom interactive components. HTML that passes WCAG 2.1 AA guidelines works better for all users, including those using screen readers or keyboard-only navigation.

.HTML compared to alternatives

.HTML compared to alternative formats
Formats	Criteria	Winner
.HTML vs .XHTML	Parser strictness HTML uses a lenient error-recovery parser that renders malformed markup. XHTML requires strict XML well-formedness — a single unclosed tag causes a fatal parse error. HTML's tolerance made it the practical winner for web authoring.	HTML wins
.HTML vs .MARKDOWN	Authoring speed Markdown syntax is faster to write for text-heavy content (headings, lists, links) but cannot express interactive elements, forms, or complex layouts that HTML handles natively.	MARKDOWN wins
.HTML vs .PDF	Editability HTML source is plain text editable in any text editor. PDF is a binary format requiring specialized tools to modify content, making HTML the better choice for living documents.	HTML wins

Technical reference

Specs CLI Conversions Security Ecosystem

MIME Type: text/html
Developer: World Wide Web Consortium (W3C) / WHATWG
Year Introduced: 1993
Open Standard: Yes — View specification

Binary Structure

HTML is a plain-text format encoded in UTF-8 (recommended by the spec, though legacy pages may use ISO-8859-1 or Windows-1252). Files have no binary magic bytes. The document typically begins with `<!DOCTYPE html>` followed by the `<html>` root element. A UTF-8 BOM (EF BB BF) is permitted but discouraged by the WHATWG spec — browsers handle it, but it can break PHP short tags and shell scripts that concatenate HTML. Line endings are normalized by parsers: CR, LF, and CRLF are all treated as a single line break.

1991Tim Berners-Lee publishes the first HTML document at CERN, defining 18 elements1995HTML 2.0 published as RFC 1866 — first formal specification with forms support1997HTML 3.2 (W3C Recommendation) adds tables, applets, and text flow around images1999HTML 4.01 introduces CSS separation, accessibility attributes, and scripting framework2000XHTML 1.0 reformulates HTML 4 as strict XML, requiring well-formed documents2008WHATWG publishes first HTML5 working draft, introducing canvas, video, audio, and semantic elements2014W3C publishes HTML5 as a Recommendation, formalizing the Living Standard approach2019W3C and WHATWG agree on a single HTML Living Standard maintained by WHATWG

Validate HTML syntax with html5validator other

html5validator --root ./public --also-check-css

Runs the Nu Html Checker against all HTML files in the ./public directory. The --also-check-css flag validates embedded CSS. Returns non-zero exit code on validation errors, making it suitable for CI pipelines.

Convert HTML to PDF with wkhtmltopdf other

wkhtmltopdf --enable-local-file-access --page-size A4 input.html output.pdf

Renders the HTML file using a WebKit engine and outputs a paginated PDF. --enable-local-file-access permits loading local CSS and image assets. --page-size sets the output to A4 dimensions.

Minify HTML with html-minifier-terser other

npx html-minifier-terser --collapse-whitespace --remove-comments --minify-css true --minify-js true -o output.html input.html

Removes comments, collapses whitespace, and minifies inline CSS and JS in one pass. Reduces file size for production deployment without altering rendered output.

HTML → PDF render variable Converting HTML to PDF produces a portable, print-ready snapshot of a web page. PDF output preserves layout fidelity for archival, legal documentation, or offline distribution where a browser is unavailable.

HTML → MARKDOWN export lossy Markdown is the standard format for documentation repositories, README files, and static site generators. Extracting structured content from HTML into Markdown strips presentation markup and retains semantic text.

HTML → PLAIN TEXT export lossy Stripping all tags yields raw text content for indexing, NLP processing, or accessibility-focused text extraction where markup is unnecessary overhead.

HIGH

Attack Vectors

XSS (Cross-Site Scripting): malicious JavaScript injected via unsanitized user input into innerHTML, href, or event handler attributes
Script injection: inline <script> tags or javascript: URIs execute arbitrary code when the page loads
Iframe clickjacking: transparent iframes overlaid on legitimate UI elements trick users into clicking hidden actions
Form phishing: fake login forms embedded in HTML mimic trusted sites to harvest credentials
CSS data exfiltration: attribute selectors and @font-face requests can leak sensitive data character-by-character

Mitigation: FileDex processes HTML files locally in the browser with no external resource loading, no script execution, and no network requests. Content Security Policy headers block inline scripts and frame embedding.

html5validator tool

CLI wrapper around the Nu Html Checker for validating HTML5 documents

Prettier tool

Opinionated code formatter supporting HTML, CSS, JS, and more

htmlparser2 library

Fast and forgiving HTML/XML parser for Node.js with streaming support

Beautiful Soup library

Python library for parsing HTML and extracting data from web pages

Cheerio library

jQuery-like HTML manipulation library for Node.js server-side processing

WHATWG HTML Living Standard spec

The single authoritative specification for the HTML language