Comma-Separated Values
CSV stores rows of data as plain-text lines with delimiter-separated fields — but the delimiter is not always a comma, the encoding is never declared, and opening a CSV in Excel can silently destroy your data by stripping leading zeros and converting text to dates.
Conversion not yet available. CSV to Excel transformation requires spreadsheet rendering — a feature planned for a future update.
Common questions
What is a CSV file and how does it store data?
A CSV file is a plain-text document that stores tabular data with one record per line and fields separated by a delimiter, usually a comma. It has no formatting, no formulas, and no data types — every value is stored as raw text. The format dates back to 1972 and is supported by every spreadsheet application, database, and programming language.
Why does Excel change my CSV data?
Excel automatically guesses data types when opening CSV files. It strips leading zeros from codes and ZIP numbers, converts large values to scientific notation, and turns text resembling dates into date values. To prevent this, import via Data, Get and Transform, and set all columns to Text before loading.
What is the difference between CSV and TSV?
CSV uses commas to separate fields while TSV uses tab characters. Tabs rarely appear in data values, so TSV avoids the quoting complexity and the locale conflicts that plague CSV files in European countries where commas serve as decimal separators instead of field delimiters.
How do I open a large CSV file that crashes Excel?
Excel has a row limit of roughly one million rows. For larger files, use Python with the pandas library, which processes CSV limited only by available memory. LibreOffice Calc handles slightly more rows. Command-line tools like csvkit work on files of any size.
How do I fix encoding problems in a CSV file?
Encoding issues appear as garbled characters, especially for Arabic, accented, or Asian text. Determine the current encoding, then re-save as UTF-8. Most text editors offer a Save As with encoding option. LibreOffice Calc import dialog lets you preview different encodings before committing.
Is there a formal standard for the CSV format?
RFC 4180, published in 2005, describes common CSV conventions but is classified as Informational rather than a formal standard. It arrived over 30 years after CSV was already ubiquitous and leaves encoding, null values, and delimiter choice undefined. The W3C published supplementary CSVW metadata recommendations in 2015.
Can a CSV file contain multiple sheets or tabs?
No. A CSV file contains a single flat table with no support for multiple sheets, formatting, formulas, or charts. Spreadsheet formats like XLSX and ODS support multiple sheets. To store related tables in CSV, save each sheet as a separate file.
What makes .CSV special
Tabular data has been stored as delimiter-separated text since the earliest days of electronic computing, but the format we call CSV has no single birthday, no inventor, and no authoritative specification. That paradox — a format used by billions of files worldwide with no formal definition — explains both CSV's universal adoption and its maddening interoperability problems.
Continue reading — full technical deep dive
IBM origins and the missing standard
In 1972, IBM's Fortran compiler implementations introduced list-directed input/output, a feature that used commas to separate data values in sequential records. This was not designed as a file format — it was a programming language feature for reading and writing data conveniently. But the resulting comma-delimited files proved so useful for data exchange that the pattern spread to CP/M systems in the late 1970s and to MS-DOS database exports (dBASE, Lotus 1-2-3) through the 1980s.
By the mid-1990s, CSV was everywhere — used by every spreadsheet, database export tool, and data pipeline — yet no standards body had ever defined it. The format existed purely as convention, with each implementation making its own decisions about delimiters, quoting, encoding, and line endings.
RFC 4180: thirty years too late
In October 2005, Yakov Shafranovich published RFC 4180, titled "Common Format and MIME Type for Comma-Separated Values (CSV) Files." The RFC is unusual in two respects. First, it is classified as "Informational" rather than "Standards Track" — it describes existing practice rather than prescribing a standard. Second, it arrived more than 30 years after CSV was already ubiquitous, by which point countless incompatible implementations existed.
RFC 4180 defines seven rules: records are separated by CRLF line breaks; an optional header row may appear first; each record contains the same number of fields; fields are separated by commas; fields containing commas, double quotes, or line breaks must be quoted; double quotes inside quoted fields are escaped by doubling them; and spaces are considered part of the field value.
These rules are sensible but incomplete. The RFC says nothing about character encoding, maximum field length, how to handle null values, or what happens when field counts differ between rows. It mentions only the comma as a delimiter. The result is a specification that covers the common case but leaves every edge case undefined.
Delimiter chaos
The name "Comma-Separated Values" is misleading. In countries where the comma serves as the decimal separator — Germany, France, Italy, the Netherlands, Brazil, and much of Europe — using a comma as the field delimiter would make numeric data ambiguous. Is "1,234" a number (one thousand two hundred thirty-four in English notation) or two fields ("1" and "234")?
Microsoft Excel resolves this by using the operating system's list separator setting, which is configured in Windows Regional Settings. In a French Windows installation, the list separator is a semicolon, so Excel's "Save as CSV" produces semicolon-delimited files with a .csv extension. A German user receiving this file in an English-locale system opens it in Excel and sees the entire row crammed into a single column, because English Excel expects commas.
Tab-separated values (TSV) sidestep the delimiter problem entirely by using the tab character, which never appears in normal text data. TSV files sometimes use the .tsv extension but are also saved as .csv, adding another layer of confusion. The pipe character (|) appears in enterprise data exports for similar reasons.
The encoding nightmare
CSV has no mechanism to declare its character encoding. There is no header, no metadata block, no byte-order mark requirement. A CSV file could be UTF-8, Windows-1252, ISO-8859-1, Shift-JIS, GB2312, or any other encoding, and no information within the file itself reveals which one.
RFC 4180 does not address encoding at all. The MIME type registration allows an optional charset parameter (text/csv; charset=utf-8), but this only applies when the file is served over HTTP — it provides no help for files on disk or attached to emails.
In practice, Excel on Western Windows systems produces Windows-1252 encoded CSV. Excel on Japanese systems may produce Shift-JIS. When saving as "CSV UTF-8 (Comma delimited)," Excel prepends a UTF-8 byte-order mark (EF BB BF) — the only reliable encoding signal, but one that some parsers treat as data rather than metadata, inserting a visible garbage character at the start of the first field.
Quoting rules and edge cases
RFC 4180's quoting rules handle the common cases: fields containing commas, double quotes, or line breaks must be enclosed in double quotes, and literal double quotes inside quoted fields are escaped by doubling them. A field containing the text He said "hello" becomes "He said ""hello""" in the CSV.
Edge cases abound. What about a field that contains only whitespace? RFC 4180 says spaces are significant — they are part of the field value, not ignorable padding. But many CSV producers trim whitespace silently, and many consumers strip it on import. What about an empty field? Two adjacent commas (,,) represent an empty field, distinct from a quoted empty string (""). Whether parsers distinguish between these varies by implementation.
The most dangerous edge case is the embedded newline. A quoted field can span multiple lines — this is a single field containing a line break. Parsers that process CSV line-by-line without quote-state tracking will split this into two records, corrupting every subsequent row in the file.
Excel destroys data: the number one CSV pain point
Microsoft Excel is the world's most common CSV viewer, and it silently corrupts data every time it opens a CSV file. Excel scans the first rows of each column, infers a data type, and applies that type to the entire column — with no user warning and no undo path.
Leading zeros are stripped: ZIP code 01234 becomes 1234, product code 007 becomes 7, phone number 0501234567 becomes 501234567. Large numbers convert to scientific notation: a 16-digit credit card reference 1234567890123456 becomes 1.23457E+15. Text strings that resemble dates are converted: "MAR1" becomes March 1, "1-2" becomes January 2, gene names like "SEPT7" become September 7 — a problem so widespread in genomics that the Human Gene Nomenclature Committee renamed 27 human genes in 2020 specifically to avoid Excel corruption.
Simply opening a CSV file in Excel and saving it — without making any manual changes — can irreversibly corrupt the data. The safe approach is to use Data > Get & Transform > From Text/CSV, which presents an import dialog where column types can be set to Text before loading.
W3C CSV on the Web
In 2015, the W3C published a set of recommendations collectively known as CSVW (CSV on the Web). These define a JSON-LD metadata file that accompanies a CSV file and declares column names, data types, encoding, delimiter, null values, and transformation rules. The metadata file enables machines to parse CSV unambiguously — solving every problem RFC 4180 left open.
Adoption has been limited. Government open data portals in the UK and parts of Europe use CSVW metadata, but the mainstream data science and business analytics ecosystems have not embraced it. Most CSV files continue to ship without metadata, relying on human knowledge and heuristic detection.
TSV: the tab-delimited sibling
Tab-Separated Values (TSV) use the tab character as the delimiter and have their own IANA registration: text/tab-separated-values (assigned in 1993, predating CSV's text/csv by 12 years). TSV avoids the comma-as-decimal-separator problem entirely and eliminates the need for quoting in most cases, since tabs rarely appear in data values.
TSV is the standard format for bioinformatics data exchange, linguistics corpora, and database bulk import/export operations. Despite its advantages, TSV never displaced CSV in the spreadsheet ecosystem, largely because Excel defaults to comma separation and the .csv extension is more widely recognized.
Why CSV persists
JSON, XML, Parquet, and Avro all provide richer structure, explicit typing, and unambiguous parsing. Yet CSV remains the most common data interchange format. The reasons are simplicity and universality: a CSV file can be created with echo, read by any text editor, imported by every database, and processed by every programming language's standard library. No special tooling, no schema definition, no compilation step. For quick data exchange between humans who can visually inspect the result, nothing beats a spreadsheet exported as CSV.
.CSV compared to alternatives
| Formats | Criteria | Winner |
|---|---|---|
| .CSV vs .EXCEL (XLSX) | Data type preservation XLSX stores explicit cell types — a cell marked as Text preserves leading zeros and prevents date conversion. CSV has no type system; every value is an untyped string, and consuming applications must guess types. | XLSX wins |
| .CSV vs .EXCEL (XLSX) | Portability and simplicity CSV is plain text readable by any editor, language, or database. XLSX is a ZIP archive of XML files requiring a library to parse. CSV files are typically smaller for text-only tabular data. | CSV wins |
| .CSV vs .TSV | Delimiter ambiguity TSV uses tabs, which rarely appear in data values, eliminating the need for quoting and avoiding the comma-vs-decimal-separator conflict that plagues CSV in European locales. | TSV wins |
| .CSV vs .JSON | Structure and typing JSON supports nested objects, arrays, explicit null, booleans, and numbers as distinct types. CSV is flat (rows and columns only) with no native type system — everything is an implicit string. | JSON wins |
| .CSV vs .JSON | Human readability for tabular data Tabular data in CSV aligns naturally in columns visible in any text editor. The same data in JSON repeats key names for every row and adds structural characters (braces, brackets, quotes), often tripling file size. | CSV wins |
Technical reference
- MIME Type
text/csv- Developer
- IBM (earliest usage)
- Year Introduced
- 1972
- Open Standard
- Yes — View specification
Binary Structure
CSV is a plain-text format with no binary structure, no magic bytes, and no fixed file header. Records appear one per line, terminated by CRLF (per RFC 4180) or LF (universally accepted). Fields within each record are separated by a delimiter — comma, semicolon, tab, or pipe depending on locale and tool. Fields containing the delimiter character, double quotes, or line breaks must be enclosed in double quotes. Double quotes inside quoted fields are escaped by doubling them. An optional header row may appear as the first line, declaring column names. There is no encoding declaration — files may be UTF-8, Windows-1252, Shift-JIS, or any other encoding with no in-file signal. Excel may prepend a UTF-8 BOM (bytes EF BB BF) when saving as 'CSV UTF-8,' which is the only reliable encoding indicator but is treated as data by some parsers. PRONOM assigns the x-fmt/ prefix to CSV specifically because it has no binary signature — identification relies entirely on content heuristics.
Conversion not yet available. CSV to Excel transformation requires spreadsheet rendering — a feature planned for a future update.
Attack Vectors
- CSV formula injection
- Excel type coercion data corruption
Mitigation: FileDex does not execute CSV formulas or apply type coercion. This is a reference page providing format documentation and CLI tool guidance. When processing CSV in spreadsheet applications, prefix formula-triggering characters with a single quote or tab to prevent execution.
- Specification RFC 4180 — Common Format and MIME Type for Comma-Separated Values (CSV) Files
- Specification W3C CSV on the Web: A Primer (CSVW Recommendations, 2015)
- Registry CSV, Comma Separated Values (RFC 4180) — Library of Congress Format Description (fdd000323)
- Registry Comma Separated Values (x-fmt/18) — The National Archives PRONOM Registry
- Registry IANA Media Type: text/csv
- History Comma-separated values — Wikipedia