.XLSX Microsoft Excel Spreadsheet (Open XML)
.xlsx

Microsoft Excel Spreadsheet (Open XML)

Every .xlsx is secretly a ZIP archive of XML defined by ECMA-376 — rename one to .zip and your file explorer unzips it. Inside: a Shared Strings Table that stores each repeated word once, a 1900 leap-year bug Excel has never fixed, and cells that can execute commands when opened.

Document structure
PK header
[Content_Types].xml
xl/workbook.xml
sharedStrings.xml
PagesMetadataText
Not convertible

XLSX conversion is not yet available in FileDex. For now, use the CLI commands in the Developer Door to convert between spreadsheet formats with LibreOffice or openpyxl.

Common questions

What is an XLSX file?

Every XLSX you open is a compressed archive. Rename one to .zip and your file explorer will show the XML pieces inside — worksheets, cells, formulas, and formatting. Excel has used this format by default since 2007, replacing the older .xls format. It is now an international standard.

What's the difference between XLSX and XLS?

XLS was the legacy binary format used before Excel 2007 — closed and undocumented until 2008. It was limited to 65,536 rows and 256 columns. XLSX is the open successor, supporting 1,048,576 rows and 16,384 columns. XLSX files are smaller because repeated text is stored once. XLSX cannot contain code. Macros live in .xlsm files instead.

How do I open an XLSX file without Microsoft Excel?

LibreOffice Calc opens XLSX natively on Windows, macOS, and Linux — free and open source. Google Sheets imports XLSX in any browser. Apple Numbers reads XLSX on Mac and iOS. To browse the raw contents without a spreadsheet app, rename the file to .zip and extract it — you will see workbook.xml, sharedStrings.xml, and the worksheet files inside.

What's the maximum number of rows in an Excel spreadsheet?

XLSX supports 1,048,576 rows and 16,384 columns (A through XFD) per worksheet — exactly 2²⁰ rows and 2¹⁴ columns. These are Excel implementation limits, not limits defined in the ECMA-376 specification. The previous XLS format allowed only 65,536 rows and 256 columns. The jump to over a million rows came with Excel 2007, when XLSX became the default format.

What's the difference between XLSX and CSV?

CSV stores data as plain text with no types. Opening a CSV in Excel triggers guessing that destroys leading zeros, converts gene symbols to dates, and shifts long account numbers to scientific notation. XLSX declares types directly — text stays text, numbers stay numbers. Use XLSX when integrity matters; CSV when the receiver handles types.

Why does Excel change my data when I import a CSV?

CSV has no types — every cell is plain text. When Excel imports a CSV, it guesses: anything numeric-looking becomes a number, anything date-looking becomes a date. Leading zeros disappear, long account numbers shift to scientific notation, gene symbols like MARCH1 become dates. Format columns as text before opening, or build the XLSX directly so Excel never guesses.

What makes .XLSX special

40-year-old bug
February 29, 1900 still exists in Excel
Excel stores dates as numbers counting up from January 1, 1900. Number 60 lands on a date that never existed — 1900 was not a leap year. The bug came from Lotus 1-2-3 in 1983, and every Excel version has kept it.
Hidden deduplication
Every text value stored exactly once
A column with 'London' in 10,000 cells stores the word one time inside the file and points to it 10,000 times. This is what lets XLSX keep text and numbers separate — CSV cannot.
Excel renamed human genes
Dozens of gene symbols renamed in 2020 — because Excel kept turning them into dates
Gene names like MARCH1 and SEPT1 looked like dates to Excel. Every time researchers opened their data in a spreadsheet, Excel auto-changed them. So in 2020, the naming committee renamed the genes: MARCH1 became MARCHF1, SEPT1 became SEPTIN1. Microsoft didn't fix Excel. They changed the genes instead.
Contested world standard
3,522 complaints filed against the XLSX standard
When Microsoft pushed XLSX through the international standards process in 2008, countries around the world filed 3,522 formal complaints — the most any technology standard has ever received. XLSX became a standard anyway.

Rename any .xlsx to .zip and unzip it. Inside you will find an XML file called sharedStrings.xml that stores every unique word in the workbook exactly once — no matter how many cells reference it. The same format carries a date bug from 1983 that no Excel version has fixed, and a grid that stops exactly at row 1,048,576. This is what a modern spreadsheet looks like from the inside.

Continue reading — full technical deep dive

10,000 Cells of "London". Barely Any Extra Bytes. How?

Open an XLSX file containing 10,000 rows of "London" and the file is barely larger than one containing the word once. The Shared Strings Table is what makes this work.

Every unique text value in a workbook is stored exactly once in xl/sharedStrings.xml and referenced by a zero-based index. A cell containing text uses <c r="A1" t="s"><v>0</v></c> — the t="s" attribute says "type is string, value is SST index 0." Numbers live differently: <c r="B1"><v>42</v></c> has no type attribute, and the value is stored inline as a literal number.

This duality is the difference between XLSX and CSV. CSV has no types — every cell is plain text, and spreadsheet applications have to guess whether 00123 is a leading-zero string or the number 123, whether 1/5 is a date or a fraction. XLSX removes the guessing. A cell's type is declared, its value is either stored inline or referenced by index, and the Shared Strings Table deduplicates text on the way in. The file-size payoff is incidental. The design payoff is that the format knows what each cell is before anyone reads it.

That is how XLSX keeps text and numbers apart. Dates are where the system gives up — they live as numbers, and they carry a bug older than XLSX itself.

Excel Remembers February 29, 1900. That Day Never Existed. Why?

February 29, 1900 does not exist. Under the Gregorian calendar, century years are leap years only when divisible by 400 — 1900 was not. But Excel stores date serial number 60 as February 29, 1900, and every date from March 1, 1900 onward is off by one from the true count.

This was not a bug that slipped through. Lotus 1-2-3 had it in 1983. When Microsoft built Excel to read Lotus files, the leap-year error was deliberately preserved for compatibility, and every Excel version since has kept it. A spreadsheet saved in Lotus on an IBM PC in 1984 and opened in Excel today produces the same date arithmetic — because Excel never fixed the ghost leap day.

Dates are numbers with a costume. Serial 1 is January 1, 1900. Serial 44297 is April 10, 2021. The number is stored in the cell; the formatting tells the renderer to display it as a date. Remove the format and the number reappears. Change a text cell to a date format and it becomes February 29, 1900 — the ghost leap day, waiting. An optional 1904 date system, originally shipped on Mac, starts from January 1, 1904 and skips the bug. It is declared in workbook.xml via <workbookPr date1904="1"/>.

Dates inherited one constraint from 1983. The grid around them carries another — a hard wall you probably have never reached.

The Grid Ends at Row 1,048,576. Why That Number?

Your spreadsheet stops at row 1,048,576. Column XFD is the wall. Beyond that, nothing exists — the format does not address cells outside the grid.

The limits are 2²⁰ rows × 2¹⁴ columns, introduced in Excel 2007 when XLSX replaced the binary XLS format. XLS topped out at 65,536 rows and 256 columns (column IV). The jump was 16× rows and 64× columns. These limits are not defined in ECMA-376 — they are an Excel implementation constraint that became the de facto standard. The format itself is unbounded.

Formulas live inside <f> elements, stored as plain text: <c r="D2"><f>SUM(B2:B100)</f><v>1500</v></c>. The <v> element caches the last calculated value. When an application opens the workbook, it renders the cached value immediately without recalculating, then recalculates only on edit or refresh. This split matters: a read-only tool does not need a calculation engine. LibreOffice Calc, openpyxl, and a JavaScript SheetJS parser can all display the results of a million-row sheet without evaluating a single formula.

Shared formulas compress repetition. When a column has the same formula with shifting references — =A2*B2 in row 2, =A3*B3 in row 3, and so on — Excel stores one copy with a ref attribute indicating the range it covers. The format knows copy-drag columns are common and optimizes for them at the XML level.

A million rows and fourteen thousand columns is room for almost any dataset on a single worksheet. But what happens to the data on the way in is where most spreadsheet stories actually break.

Excel Knows Text From Numbers. So Why Did It Rename Human Genes?

In August 2020, the HUGO Gene Nomenclature Committee renamed dozens of human gene symbols. MARCH1 became MARCHF1. SEPT1 became SEPTIN1. DEC1 became DELEC1. HGNC's fix was to rename the genes. Microsoft did not patch Excel. Open genenames.org and search for MARCHF1 — the record still lists MARCH1 as a previous symbol.

The problem had existed for two decades. Gene symbols like MARCH1 and SEPT1 were being auto-converted to dates ("1-Mar", "1-Sep") every time a researcher imported a CSV of gene expression data into Excel. Excel's auto-date detection runs before a human sees the cell, and once the conversion happens the original symbol is gone — there is no undo for a file that was saved. The errors were pervasive enough across published genomics papers that HGNC updated its nomenclature rules to make date collisions structurally impossible.

The root cause connects directly to the Shared Strings Table. XLSX protects types once data is inside the file — a cell marked t="s" stays text forever. But CSV has no types. When Excel imports a CSV, it guesses. The guessing is what destroys the data. The same pattern damages leading zeros in phone numbers and ZIP codes (00210 becomes 210), long account identifiers get shifted to scientific notation, and anything that looks numeric but is meant as a string can disappear.

The single defense is to format the column as text before the CSV is opened, or to build the XLSX directly and never let Excel guess.

.XLSX compared to alternatives

.XLSX compared to alternative formats
Formats Criteria Winner
.XLSX vs .CSV
Data type preservation
XLSX stores numbers, dates, and text as distinct types via the Shared Strings Table and cell type attributes. CSV treats everything as plain text — opening a CSV in Excel can silently destroy leading zeros, convert gene names to dates, and strip long numbers.
XLSX wins
.XLSX vs .CSV
Multiple sheets
XLSX supports multiple named worksheets, charts, and pivot tables in a single file. CSV contains exactly one flat table with no sheet concept, no formulas, and no formatting.
XLSX wins
.XLSX vs .XLS
Maximum dimensions
XLSX supports 1,048,576 rows × 16,384 columns. The legacy XLS format is limited to 65,536 rows × 256 columns — a 16× row increase that makes XLSX viable for large datasets.
XLSX wins
.XLSX vs .XLS
Inspectability
XLSX is a ZIP of XML files — unzip and read with any text editor. XLS was a proprietary binary format whose internals were a Microsoft trade secret until 2008.
XLSX wins

Technical reference

MIME Type
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Magic Bytes
50 4B 03 04 ZIP archive containing xl/ directory with worksheets.
Developer
Microsoft / Ecma International
Year Introduced
2007
Open Standard
Yes
00000000504B0304 PK..

ZIP archive containing xl/ directory with worksheets.

Binary Structure

An XLSX file is a ZIP archive (magic bytes 50 4B 03 04, 'PK') following the Open Packaging Conventions. The ZIP contains a root [Content_Types].xml declaring MIME types for all parts, a _rels/.rels entry point, and an xl/ directory with workbook.xml (sheet list and named ranges), worksheets/sheet1.xml through sheetN.xml (cell data in row-column XML), sharedStrings.xml (deduplicated text values referenced by index), styles.xml (number formats, fonts, fills, borders), and theme/theme1.xml (color and font schemes). To distinguish XLSX from DOCX or PPTX (all share the PK magic bytes), check [Content_Types].xml for the SpreadsheetML content type. Typical compression ratio is 50-80% — XML text compresses well under ZIP deflate.

OffsetLengthFieldExampleDescription
0x00 4 bytes ZIP Signature 50 4B 03 04 PK local file header — shared by all OOXML formats (XLSX, DOCX, PPTX)
0x04 2 bytes Version needed 14 00 Minimum ZIP version to extract (2.0)
0x1A 2 bytes Filename length 13 00 Length of first entry name — typically '[Content_Types].xml' (19 bytes)
1979VisiCalc creates the spreadsheet concept on the Apple II1983Lotus 1-2-3 dominates IBM PC — introduces the February 29, 1900 date bug1985Microsoft Excel 1.0 ships for Macintosh2006ECMA-376 (Office Open XML) approved — SpreadsheetML defined in Part 1 §182007Excel 2007 makes XLSX the default format — row limit jumps to 1,048,5762008ISO/IEC 29500 approved after contentious fast-track ballot with 3,522 technical comments2016ECMA-376 5th edition published — current version of the Office Open XML standard, with ISO/IEC 29500-1:2016 for Part 1
Inspect XLSX ZIP structure other
unzip -l spreadsheet.xlsx

Lists all XML parts inside the XLSX ZIP archive. Reveals the OPC structure: [Content_Types].xml, workbook, worksheets, shared strings, and styles.

Read cells with Python openpyxl other
python3 -c "from openpyxl import load_workbook; wb = load_workbook('data.xlsx'); ws = wb.active; [print(row) for row in ws.iter_rows(values_only=True)]"

Loads XLSX and prints all cell values row by row. openpyxl reads the Shared Strings Table and resolves cell references to actual values.

Convert XLSX to CSV via LibreOffice other
libreoffice --headless --convert-to csv spreadsheet.xlsx

Headless LibreOffice exports the first sheet as CSV. Formulas are resolved to their cached values. No GUI required — runs on servers.

XLSX conversion is not yet available in FileDex. For now, use the CLI commands in the Developer Door to convert between spreadsheet formats with LibreOffice or openpyxl.

MEDIUM

Attack Vectors

  • Formula injection (CSV injection)
  • XML External Entity (XXE)
  • Macro injection via XLSM variant
  • Embedded OLE/ActiveX objects

Mitigation: Open XLSX in trusted spreadsheet applications only. For server-side processing, disable external entity resolution to prevent XXE attacks. When building XLSX from user-submitted data, sanitize any cell starting with =, +, -, or @ to block formula injection. Treat unknown XLSX attachments like any executable. FileDex does not parse XLSX — this reference page is static, no file upload.

The original creator of XLSX and reference implementation for Office Open XML — full SpreadsheetML support including pivot tables, conditional formatting, and the macro-enabled .xlsm variant
Browser-based spreadsheet editor with XLSX import/export and real-time collaboration — the most-used consumer spreadsheet app outside Excel
openpyxl library
Python library for reading and writing XLSX files without Excel dependency
Apache POI library
Java library for OOXML spreadsheet processing used in enterprise data pipelines
SheetJS library
JavaScript library for parsing and generating XLSX in browsers and Node.js
Free open-source spreadsheet editor with strong XLSX compatibility
xlsx2csv tool
Python CLI tool for converting XLSX sheets to CSV for data pipeline ingestion