HyperText Markup Language
HTML defines every web page's structure using markup tags that browsers render visually. Standardized by W3C since 1993 and now maintained by WHATWG as a living standard, HTML has no version number anymore — the spec updates continuously as browsers ship new features.
Conversion not yet available. HTML to PDF rendering requires a browser engine — a feature planned for a future update.
Common questions
What is an HTML file and what is it used for?
An HTML file is a plain-text document containing HyperText Markup Language tags that define the structure of a web page. Browsers read these tags and render the visual page you see — headings, paragraphs, images, links, and forms. Every website on the internet is built on HTML. The file is editable in any text editor and viewable in any web browser.
Why is it called index.html?
Web servers return a default document when a URL points to a directory. The convention of naming that file index.html originated with the NCSA HTTPd server in 1993, the first widely deployed web server. Apache, Nginx, Vercel, and virtually every hosting platform still follow this convention today.
What is the difference between .html and .htm?
They are functionally identical. The .htm extension dates back to MS-DOS and early Windows, which enforced a three-character extension limit (8.3 filenames). When long filenames became standard with Windows 95, .html took over as the preferred extension. Web servers return the same text/html MIME type for both.
Is HTML a programming language?
No. HTML is a markup language that describes document structure and content using tags. It cannot perform calculations, execute logic, or control program flow. Programming languages like JavaScript, Python, and Java handle computation. HTML provides the structural skeleton that programming languages and CSS bring to life.
How do I view the HTML source of a web page?
Right-click anywhere on the page and select View Page Source, or press Ctrl+U on Windows and Linux or Cmd+Option+U on macOS. For the live DOM including JavaScript-generated content, press F12 to open browser developer tools and inspect the Elements panel.
What is the difference between HTML and HTML5?
HTML5 was a W3C Recommendation published in 2014 that added semantic elements, native audio and video, and canvas drawing. Since 2019, the sole specification is the WHATWG HTML Living Standard, which has no version number and is updated continuously. The term HTML5 now informally means modern HTML.
Can I open an HTML file without a browser?
Yes. HTML files are plain text, so any text editor opens them directly — VS Code, Notepad++, Sublime Text, vim, or even Windows Notepad. You will see the raw markup tags instead of the rendered page. Terminal commands can also display HTML source.
How do I create an HTML file?
Open any text editor, type the basic structure starting with the DOCTYPE declaration, html, head, and body tags, then save the file with a .html extension. Open the saved file in a browser to see the rendered result. No special software or compiler is required.
What makes .HTML special
The language of the web
Beneath every website, email newsletter, and web application lies an HTML document. HyperText Markup Language is not a programming language — it cannot perform calculations, loop over data, or make decisions. It is a markup language: a system of tags that describes what content is and how it relates to other content. A browser reads those tags and constructs a visual page. This distinction matters because it defines what HTML can and cannot do on its own, and why CSS and JavaScript exist as separate layers.
Continue reading — full technical deep dive
Origin: Berners-Lee and the 18 tags
Tim Berners-Lee wrote the first HTML description at CERN in 1991 as part of his proposal for a "WorldWideWeb" information management system. That initial document defined roughly 18 elements — headings (h1 through h6), paragraphs, lists, anchors (hyperlinks), and a handful of text-level tags. There was no formal specification, no standards body, and no version number. The first web browser and the first web server both ran on a NeXT workstation in Berners-Lee's office. HTML was designed to be simple enough that a physicist could mark up a research paper without specialized tools — a design philosophy that persists in the language's tolerance for malformed markup.
Document structure: DOCTYPE, head, body
A valid modern HTML document follows a required structure. The document begins with a DOCTYPE declaration — in current HTML, simply <!DOCTYPE html> — which instructs the browser to use standards mode rather than quirks mode. The root element is <html>, typically carrying a lang attribute for accessibility and SEO. Inside <html>, two children exist: <head> and <body>.
The <head> element contains metadata invisible to users: the document title (shown in browser tabs and search results), character encoding declaration (meta charset="utf-8"), references to external CSS and JavaScript files, Open Graph tags for social media previews, and favicon links. The <body> element contains everything the user sees and interacts with: text, images, forms, tables, and embedded media.
This head/body split is fundamental to how the web works. Search engine crawlers read <head> metadata to understand what a page is about before parsing the visible content. Browsers begin rendering <body> content as it arrives, even before the full document has downloaded — a behavior called progressive rendering that makes HTML uniquely suited for networked delivery.
Encoding: why UTF-8 won
The WHATWG specification strongly recommends UTF-8 for all HTML documents. Early HTML files were typically encoded in ISO-8859-1 (Latin-1) or Windows-1252, which cover Western European characters but cannot represent Arabic, Chinese, Japanese, Korean, Cyrillic, or most other scripts. UTF-8 encodes the full Unicode range in a variable-width format compatible with ASCII.
Encoding is declared via a meta element in the head: <meta charset="utf-8">. If this declaration is absent, browsers fall back to heuristic detection or the encoding specified in the HTTP Content-Type header. Mismatched encoding — a file saved as UTF-8 but declared as ISO-8859-1, or vice versa — produces mojibake: garbled characters where accented letters, Arabic script, or CJK ideographs should appear. This remains one of the most common HTML debugging issues.
Semantic HTML: meaning over appearance
The HTML5 era introduced semantic elements that describe their content's purpose rather than its appearance. Before HTML5, developers used generic <div> elements with CSS classes for everything. The semantic set includes <header>, <footer>, <main>, <nav>, <aside>, <article>, <section>, <figure>, <figcaption>, and <time>.
Semantic markup serves two audiences. Screen readers use element types to build a navigable page outline — a <nav> element is announced as navigation, an <article> as a self-contained composition. Search engines use heading hierarchy (h1 through h6) and semantic containers to understand content structure, which directly influences ranking. Google's documentation explicitly recommends semantic HTML for SEO.
The WHATWG Living Standard: no more version numbers
The HTML specification has had a turbulent governance history. From 1995 to 1999, the IETF and then W3C published numbered versions: HTML 2.0, 3.2, 4.0, and 4.01. In 2000, the W3C pivoted to XHTML — an XML-strict serialization of HTML that required well-formed documents and broke backward compatibility. When the W3C began work on XHTML 2.0, which would have been incompatible with existing web content, browser vendors objected.
In 2004, Apple, Mozilla, and Opera formed the Web Hypertext Application Technology Working Group (WHATWG) to develop a practical evolution of HTML 4. Their work became HTML5, which the W3C eventually adopted and published as a Recommendation in October 2014. But by 2017, the W3C's last numbered snapshot (HTML 5.2) was already diverging from the WHATWG's continuously updated specification.
On 28 May 2019, the W3C and WHATWG signed a memorandum of understanding: the WHATWG HTML Living Standard would be the single authoritative HTML specification going forward. W3C would no longer publish independent HTML versions. This means the term "HTML5" is technically a historical artifact — a frozen snapshot from 2014. The current specification has no version number and is updated continuously. When developers say "HTML5," they usually mean "modern HTML as defined by the Living Standard," but the distinction matters for standards compliance and specification references.
The index.html convention
Web servers serve a default document when a URL points to a directory rather than a specific file. The nearly universal convention is to look for a file named index.html. This practice originated with the NCSA HTTPd server in 1993, the first widely deployed web server software. When a user visits https://example.com/, the server returns the contents of /index.html from the document root.
The convention persists across Apache, Nginx, Cloudflare Pages, Netlify, Vercel, GitHub Pages, and virtually every hosting platform. Alternative default filenames (index.htm, default.html, default.asp) exist in some server configurations, but index.html remains dominant. Understanding this convention is essential for static site deployment and explains why the homepage of most websites lives in a file named index.html.
View-source: the web's open classroom
One of HTML's most distinctive cultural features is transparency. Any browser can display the raw HTML source of any web page — right-click, "View Page Source," or type view-source: before a URL. This openness was intentional. Berners-Lee designed the web as an open system where anyone could learn by reading existing pages and copying patterns. Entire generations of web developers learned HTML not from textbooks but from viewing the source of sites they admired. Browser developer tools (F12) extend this further, showing the live DOM, CSS rules, network requests, and JavaScript execution in real time.
Security: XSS and the trust boundary
HTML is the primary attack surface for web security vulnerabilities. Cross-Site Scripting (XSS) — ranked in the OWASP Top 10 — occurs when an attacker injects malicious HTML or JavaScript into a page viewed by other users. The injection point is typically unsanitized user input rendered into innerHTML, href attributes, or event handlers. A successful XSS attack can steal session cookies, redirect users to phishing pages, or modify page content.
Defenses include Content Security Policy (CSP) headers that restrict which scripts can execute, input sanitization libraries (DOMPurify, Bleach), and the HttpOnly cookie flag that prevents JavaScript from accessing authentication tokens. The <iframe> element introduces additional risks: clickjacking attacks overlay transparent frames on legitimate UI elements to trick users into unintended actions. The X-Frame-Options header and frame-ancestors CSP directive mitigate this vector.
The .html versus .htm distinction
Both extensions are functionally identical. The three-character .htm variant exists because MS-DOS and Windows 3.x enforced an 8.3 filename limit — eight characters for the name, three for the extension. Files could not have a four-character extension like .html. When Windows 95 introduced long filename support, .html became the standard, but .htm persists in legacy systems, older Microsoft toolchains, and some server configurations. Web servers return the same text/html MIME type for both extensions.
.HTML compared to alternatives
| Formats | Criteria | Winner |
|---|---|---|
| .HTML vs .XML | Parser error handling HTML uses a lenient error-recovery parser that renders malformed markup gracefully — unclosed tags, missing attributes, and improper nesting are auto-corrected. XML requires strict well-formedness; a single error produces a fatal parse failure with no rendered output. | HTML wins |
| .HTML vs .PDF | Editability and reflow HTML is plain text editable in any text editor, and content reflows automatically to fit any screen width. PDF is a fixed-layout binary format requiring specialized tools to modify, with no native responsive reflow capability. | HTML wins |
| .HTML vs .MARKDOWN | Expressiveness HTML supports forms, tables with merged cells, embedded media, interactive elements, ARIA accessibility attributes, and arbitrary nesting. Markdown covers headings, lists, links, and basic formatting but requires inline HTML for anything beyond its limited syntax. | HTML wins |
Technical reference
- MIME Type
text/html- Developer
- World Wide Web Consortium (W3C) / WHATWG
- Year Introduced
- 1993
- Open Standard
- Yes — View specification
Binary Structure
HTML is a plain-text format with no binary structure, no magic bytes, and no fixed file header. Documents typically begin with the ASCII string `<!DOCTYPE html>` followed by the `<html>` root element. UTF-8 is the recommended encoding; a UTF-8 BOM (bytes EF BB BF) may precede the DOCTYPE but is discouraged by the WHATWG specification. Browsers identify HTML through content sniffing — the WHATWG MIME Sniffing Standard defines the algorithm — while PRONOM identifies HTML generically by the presence of `<html>` or `<body>` tags within the first bytes. Line endings (CR, LF, or CRLF) are normalized by parsers and have no semantic effect. Elements form a tree structure (the DOM) when parsed, with `<html>` as the root, `<head>` and `<body>` as its two children, and all visible content nested inside `<body>`.
Conversion not yet available. HTML to PDF rendering requires a browser engine — a feature planned for a future update.
Attack Vectors
- Cross-Site Scripting (XSS)
- Iframe clickjacking
- HTML injection / phishing forms
- Script injection via event handlers
Mitigation: FileDex processes HTML files entirely within the local browser environment — no file uploads to servers, no external resource loading, and no script execution from analyzed files. Content Security Policy headers restrict inline script execution.
- Specification WHATWG HTML Living Standard
- Specification W3C HTML5.2 Recommendation (frozen snapshot, December 2017)
- Registry IANA Media Type: text/html (RFC 2854)
- Registry Library of Congress Format Description — HyperText Markup Language Format Family (fdd000475)
- History W3C/WHATWG Memorandum of Understanding (28 May 2019)
- History HTML — Wikipedia