.HTML HyperText Markup Language
.html

HyperText Markup Language

HTML defines web page structure using markup tags that browsers render visually. FileDex provides local HTML analysis and format reference directly in your browser — no file uploads, no server processing.

File structure
Header schema
Records structured data
Markup LanguageText FormatUTF-8W3C / WHATWG1991
By FileDex
Not convertible

Markup format. Conversion is not applicable.

Common questions

Can I open an HTML file without a web browser?

Yes. HTML files are plain text, so any text editor (VS Code, Notepad++, Sublime Text) opens them directly. You will see the raw markup tags instead of the rendered page. Terminal tools like `cat` or `less` also display HTML source.

What is the difference between .html and .htm file extensions?

They are functionally identical. The .htm extension dates back to MS-DOS and Windows 3.1, which enforced a three-character extension limit. Modern systems treat both extensions the same — servers return `text/html` for either one.

Is HTML a programming language?

No. HTML is a markup language — it describes document structure and content but cannot perform logic, loops, or calculations. Programming languages like JavaScript add interactivity and computation to HTML pages.

How do I check if my HTML file is valid?

Use the W3C Markup Validation Service at validator.w3.org, or run html5validator from the command line. These tools check for unclosed tags, missing required attributes, and deprecated elements against the HTML Living Standard.

What makes .HTML special

What is an HTML file?

HTML (HyperText Markup Language) is the standard markup language for documents displayed in web browsers. It defines the structure and content of web pages using elements (tags) like headings, paragraphs, links, images, and forms. HTML was invented by Tim Berners-Lee in 1991 and is now maintained as a Living Standard by WHATWG, meaning it evolves continuously rather than in numbered releases.

Continue reading — full technical deep dive

Every page on the web is ultimately an HTML document. When you visit a URL, your browser receives an HTML file and renders it visually. CSS controls appearance, and JavaScript adds behavior — but HTML is the foundation that makes a document a webpage.

How to open HTML files

  • Any web browser (Chrome, Firefox, Edge, Safari) — Double-click to render as a web page
  • VS Code (Windows, macOS, Linux) — Code editing with live preview via extensions
  • Notepad++ (Windows) — Syntax-highlighted editing
  • Sublime Text (Windows, macOS, Linux) — Fast code editor

Technical specifications

Property Value
Current Version HTML5 (Living Standard)
Encoding UTF-8 (recommended)
Type Markup language
Standard WHATWG Living Standard
MIME type text/html
Related CSS (styling), JavaScript (behavior)

Common use cases

  • Web pages: Every website is built on HTML
  • Email templates: HTML-formatted emails with rich formatting
  • Documentation: Technical docs, help files, and manuals
  • Web applications: Single-page applications (SPAs) use a single HTML shell
  • Progressive Web Apps (PWAs): Installable apps built on web technologies

HTML document structure

A minimal valid HTML5 document:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Page Title</title>
</head>
<body>
  <h1>Hello, World</h1>
  <p>This is a paragraph.</p>
</body>
</html>

The <!DOCTYPE html> declaration tells the browser to use standards mode. The <head> contains metadata not shown to users (title, character set, stylesheets), and <body> contains the visible content.

Semantic HTML

HTML5 introduced semantic elements that describe their content's meaning to both browsers and search engines:

  • <header>, <footer>, <main>, <nav>, <aside> — Page structure
  • <article>, <section> — Content grouping
  • <figure>, <figcaption> — Images with captions
  • <time datetime="2024-01-15"> — Machine-readable dates

Semantic HTML improves accessibility (screen readers understand the page structure) and SEO (Google better understands content hierarchy).

HTML and SEO

Search engines read HTML directly. Key elements that affect ranking:

  • <title> — Shown in search result titles
  • <meta name="description"> — Search snippet text
  • <h1><h6> heading hierarchy — Signals content structure
  • alt attributes on images — Enables image indexing
  • <link rel="canonical"> — Prevents duplicate content penalties

Accessibility

Well-written HTML is inherently accessible. Use alt text on all images, <label> elements for form inputs, logical heading order (h1 before h2), and ARIA attributes (role, aria-label) for custom interactive components. HTML that passes WCAG 2.1 AA guidelines works better for all users, including those using screen readers or keyboard-only navigation.

.HTML compared to alternatives

.HTML compared to alternative formats
Formats Criteria Winner
.HTML vs .XHTML
Parser strictness
HTML uses a lenient error-recovery parser that renders malformed markup. XHTML requires strict XML well-formedness — a single unclosed tag causes a fatal parse error. HTML's tolerance made it the practical winner for web authoring.
HTML wins
.HTML vs .MARKDOWN
Authoring speed
Markdown syntax is faster to write for text-heavy content (headings, lists, links) but cannot express interactive elements, forms, or complex layouts that HTML handles natively.
MARKDOWN wins
.HTML vs .PDF
Editability
HTML source is plain text editable in any text editor. PDF is a binary format requiring specialized tools to modify content, making HTML the better choice for living documents.
HTML wins

Technical reference

MIME Type
text/html
Developer
World Wide Web Consortium (W3C) / WHATWG
Year Introduced
1993
Open Standard
Yes — View specification

Binary Structure

HTML is a plain-text format encoded in UTF-8 (recommended by the spec, though legacy pages may use ISO-8859-1 or Windows-1252). Files have no binary magic bytes. The document typically begins with `<!DOCTYPE html>` followed by the `<html>` root element. A UTF-8 BOM (EF BB BF) is permitted but discouraged by the WHATWG spec — browsers handle it, but it can break PHP short tags and shell scripts that concatenate HTML. Line endings are normalized by parsers: CR, LF, and CRLF are all treated as a single line break.

1991Tim Berners-Lee publishes the first HTML document at CERN, defining 18 elements1995HTML 2.0 published as RFC 1866 — first formal specification with forms support1997HTML 3.2 (W3C Recommendation) adds tables, applets, and text flow around images1999HTML 4.01 introduces CSS separation, accessibility attributes, and scripting framework2000XHTML 1.0 reformulates HTML 4 as strict XML, requiring well-formed documents2008WHATWG publishes first HTML5 working draft, introducing canvas, video, audio, and semantic elements2014W3C publishes HTML5 as a Recommendation, formalizing the Living Standard approach2019W3C and WHATWG agree on a single HTML Living Standard maintained by WHATWG
Validate HTML syntax with html5validator other
html5validator --root ./public --also-check-css

Runs the Nu Html Checker against all HTML files in the ./public directory. The --also-check-css flag validates embedded CSS. Returns non-zero exit code on validation errors, making it suitable for CI pipelines.

Convert HTML to PDF with wkhtmltopdf other
wkhtmltopdf --enable-local-file-access --page-size A4 input.html output.pdf

Renders the HTML file using a WebKit engine and outputs a paginated PDF. --enable-local-file-access permits loading local CSS and image assets. --page-size sets the output to A4 dimensions.

Minify HTML with html-minifier-terser other
npx html-minifier-terser --collapse-whitespace --remove-comments --minify-css true --minify-js true -o output.html input.html

Removes comments, collapses whitespace, and minifies inline CSS and JS in one pass. Reduces file size for production deployment without altering rendered output.

HTML PDF render variable Converting HTML to PDF produces a portable, print-ready snapshot of a web page. PDF output preserves layout fidelity for archival, legal documentation, or offline distribution where a browser is unavailable.
HTML MARKDOWN export lossy Markdown is the standard format for documentation repositories, README files, and static site generators. Extracting structured content from HTML into Markdown strips presentation markup and retains semantic text.
HTML PLAIN TEXT export lossy Stripping all tags yields raw text content for indexing, NLP processing, or accessibility-focused text extraction where markup is unnecessary overhead.
HIGH

Attack Vectors

  • XSS (Cross-Site Scripting): malicious JavaScript injected via unsanitized user input into innerHTML, href, or event handler attributes
  • Script injection: inline <script> tags or javascript: URIs execute arbitrary code when the page loads
  • Iframe clickjacking: transparent iframes overlaid on legitimate UI elements trick users into clicking hidden actions
  • Form phishing: fake login forms embedded in HTML mimic trusted sites to harvest credentials
  • CSS data exfiltration: attribute selectors and @font-face requests can leak sensitive data character-by-character

Mitigation: FileDex processes HTML files locally in the browser with no external resource loading, no script execution, and no network requests. Content Security Policy headers block inline scripts and frame embedding.

html5validator tool
CLI wrapper around the Nu Html Checker for validating HTML5 documents
Prettier tool
Opinionated code formatter supporting HTML, CSS, JS, and more
htmlparser2 library
Fast and forgiving HTML/XML parser for Node.js with streaming support
Beautiful Soup library
Python library for parsing HTML and extracting data from web pages
Cheerio library
jQuery-like HTML manipulation library for Node.js server-side processing
WHATWG HTML Living Standard spec
The single authoritative specification for the HTML language