.CSV Comma-Separated Values
.csv

Comma-Separated Values

CSV stores rows of data as plain-text lines with delimiter-separated fields — but the delimiter is not always a comma, the encoding is never declared, and opening a CSV in Excel can silently destroy your data by stripping leading zeros and converting text to dates.

Data layout
Header schema
Records structured data
TextRFC 41801972
Not convertible

Conversion not yet available. CSV to Excel transformation requires spreadsheet rendering — a feature planned for a future update.

Common questions

What is a CSV file and how does it store data?

A CSV file is a plain-text document that stores tabular data with one record per line and fields separated by a delimiter, usually a comma. It has no formatting, no formulas, and no data types — every value is stored as raw text. The format dates back to 1972 and is supported by every spreadsheet application, database, and programming language.

Why does Excel change my CSV data?

Excel automatically guesses data types when opening CSV files. It strips leading zeros from codes and ZIP numbers, converts large values to scientific notation, and turns text resembling dates into date values. To prevent this, import via Data, Get and Transform, and set all columns to Text before loading.

What is the difference between CSV and TSV?

CSV uses commas to separate fields while TSV uses tab characters. Tabs rarely appear in data values, so TSV avoids the quoting complexity and the locale conflicts that plague CSV files in European countries where commas serve as decimal separators instead of field delimiters.

How do I open a large CSV file that crashes Excel?

Excel has a row limit of roughly one million rows. For larger files, use Python with the pandas library, which processes CSV limited only by available memory. LibreOffice Calc handles slightly more rows. Command-line tools like csvkit work on files of any size.

How do I fix encoding problems in a CSV file?

Encoding issues appear as garbled characters, especially for Arabic, accented, or Asian text. Determine the current encoding, then re-save as UTF-8. Most text editors offer a Save As with encoding option. LibreOffice Calc import dialog lets you preview different encodings before committing.

Is there a formal standard for the CSV format?

RFC 4180, published in 2005, describes common CSV conventions but is classified as Informational rather than a formal standard. It arrived over 30 years after CSV was already ubiquitous and leaves encoding, null values, and delimiter choice undefined. The W3C published supplementary CSVW metadata recommendations in 2015.

Can a CSV file contain multiple sheets or tabs?

No. A CSV file contains a single flat table with no support for multiple sheets, formatting, formulas, or charts. Spreadsheet formats like XLSX and ODS support multiple sheets. To store related tables in CSV, save each sheet as a separate file.

What makes .CSV special

30 years late
The RFC arrived 30+ years after CSV was everywhere
RFC 4180 was published in October 2005 as an Informational RFC — not a standard. By then, billions of CSV files existed with incompatible dialects. The RFC describes existing practice rather than defining it.
Excel destroys data
Opening CSV in Excel can corrupt it silently
Excel strips leading zeros from ZIP codes, converts IBANs to scientific notation, and turns gene names like SEPT7 into dates. In 2020, scientists renamed 27 human genes to stop Excel from corrupting them.
Delimiter chaos
CSV is not always comma-separated
In France, Germany, and most of Europe, the comma is the decimal separator (3,14 not 3.14). Excel in those locales uses semicolons in CSV files. A 'CSV' from French Excel uses no commas at all.
Universal format
Every language and database reads CSV natively
Python, R, Java, Go, Rust, JavaScript, SQL databases, and every spreadsheet application can read and write CSV. No other tabular format matches this breadth of native support across platforms.

Tabular data has been stored as delimiter-separated text since the earliest days of electronic computing, but the format we call CSV has no single birthday, no inventor, and no authoritative specification. That paradox — a format used by billions of files worldwide with no formal definition — explains both CSV's universal adoption and its maddening interoperability problems.

Continue reading — full technical deep dive

IBM origins and the missing standard

In 1972, IBM's Fortran compiler implementations introduced list-directed input/output, a feature that used commas to separate data values in sequential records. This was not designed as a file format — it was a programming language feature for reading and writing data conveniently. But the resulting comma-delimited files proved so useful for data exchange that the pattern spread to CP/M systems in the late 1970s and to MS-DOS database exports (dBASE, Lotus 1-2-3) through the 1980s.

By the mid-1990s, CSV was everywhere — used by every spreadsheet, database export tool, and data pipeline — yet no standards body had ever defined it. The format existed purely as convention, with each implementation making its own decisions about delimiters, quoting, encoding, and line endings.

RFC 4180: thirty years too late

In October 2005, Yakov Shafranovich published RFC 4180, titled "Common Format and MIME Type for Comma-Separated Values (CSV) Files." The RFC is unusual in two respects. First, it is classified as "Informational" rather than "Standards Track" — it describes existing practice rather than prescribing a standard. Second, it arrived more than 30 years after CSV was already ubiquitous, by which point countless incompatible implementations existed.

RFC 4180 defines seven rules: records are separated by CRLF line breaks; an optional header row may appear first; each record contains the same number of fields; fields are separated by commas; fields containing commas, double quotes, or line breaks must be quoted; double quotes inside quoted fields are escaped by doubling them; and spaces are considered part of the field value.

These rules are sensible but incomplete. The RFC says nothing about character encoding, maximum field length, how to handle null values, or what happens when field counts differ between rows. It mentions only the comma as a delimiter. The result is a specification that covers the common case but leaves every edge case undefined.

Delimiter chaos

The name "Comma-Separated Values" is misleading. In countries where the comma serves as the decimal separator — Germany, France, Italy, the Netherlands, Brazil, and much of Europe — using a comma as the field delimiter would make numeric data ambiguous. Is "1,234" a number (one thousand two hundred thirty-four in English notation) or two fields ("1" and "234")?

Microsoft Excel resolves this by using the operating system's list separator setting, which is configured in Windows Regional Settings. In a French Windows installation, the list separator is a semicolon, so Excel's "Save as CSV" produces semicolon-delimited files with a .csv extension. A German user receiving this file in an English-locale system opens it in Excel and sees the entire row crammed into a single column, because English Excel expects commas.

Tab-separated values (TSV) sidestep the delimiter problem entirely by using the tab character, which never appears in normal text data. TSV files sometimes use the .tsv extension but are also saved as .csv, adding another layer of confusion. The pipe character (|) appears in enterprise data exports for similar reasons.

The encoding nightmare

CSV has no mechanism to declare its character encoding. There is no header, no metadata block, no byte-order mark requirement. A CSV file could be UTF-8, Windows-1252, ISO-8859-1, Shift-JIS, GB2312, or any other encoding, and no information within the file itself reveals which one.

RFC 4180 does not address encoding at all. The MIME type registration allows an optional charset parameter (text/csv; charset=utf-8), but this only applies when the file is served over HTTP — it provides no help for files on disk or attached to emails.

In practice, Excel on Western Windows systems produces Windows-1252 encoded CSV. Excel on Japanese systems may produce Shift-JIS. When saving as "CSV UTF-8 (Comma delimited)," Excel prepends a UTF-8 byte-order mark (EF BB BF) — the only reliable encoding signal, but one that some parsers treat as data rather than metadata, inserting a visible garbage character at the start of the first field.

Quoting rules and edge cases

RFC 4180's quoting rules handle the common cases: fields containing commas, double quotes, or line breaks must be enclosed in double quotes, and literal double quotes inside quoted fields are escaped by doubling them. A field containing the text He said "hello" becomes "He said ""hello""" in the CSV.

Edge cases abound. What about a field that contains only whitespace? RFC 4180 says spaces are significant — they are part of the field value, not ignorable padding. But many CSV producers trim whitespace silently, and many consumers strip it on import. What about an empty field? Two adjacent commas (,,) represent an empty field, distinct from a quoted empty string (""). Whether parsers distinguish between these varies by implementation.

The most dangerous edge case is the embedded newline. A quoted field can span multiple lines — this is a single field containing a line break. Parsers that process CSV line-by-line without quote-state tracking will split this into two records, corrupting every subsequent row in the file.

Excel destroys data: the number one CSV pain point

Microsoft Excel is the world's most common CSV viewer, and it silently corrupts data every time it opens a CSV file. Excel scans the first rows of each column, infers a data type, and applies that type to the entire column — with no user warning and no undo path.

Leading zeros are stripped: ZIP code 01234 becomes 1234, product code 007 becomes 7, phone number 0501234567 becomes 501234567. Large numbers convert to scientific notation: a 16-digit credit card reference 1234567890123456 becomes 1.23457E+15. Text strings that resemble dates are converted: "MAR1" becomes March 1, "1-2" becomes January 2, gene names like "SEPT7" become September 7 — a problem so widespread in genomics that the Human Gene Nomenclature Committee renamed 27 human genes in 2020 specifically to avoid Excel corruption.

Simply opening a CSV file in Excel and saving it — without making any manual changes — can irreversibly corrupt the data. The safe approach is to use Data > Get & Transform > From Text/CSV, which presents an import dialog where column types can be set to Text before loading.

W3C CSV on the Web

In 2015, the W3C published a set of recommendations collectively known as CSVW (CSV on the Web). These define a JSON-LD metadata file that accompanies a CSV file and declares column names, data types, encoding, delimiter, null values, and transformation rules. The metadata file enables machines to parse CSV unambiguously — solving every problem RFC 4180 left open.

Adoption has been limited. Government open data portals in the UK and parts of Europe use CSVW metadata, but the mainstream data science and business analytics ecosystems have not embraced it. Most CSV files continue to ship without metadata, relying on human knowledge and heuristic detection.

TSV: the tab-delimited sibling

Tab-Separated Values (TSV) use the tab character as the delimiter and have their own IANA registration: text/tab-separated-values (assigned in 1993, predating CSV's text/csv by 12 years). TSV avoids the comma-as-decimal-separator problem entirely and eliminates the need for quoting in most cases, since tabs rarely appear in data values.

TSV is the standard format for bioinformatics data exchange, linguistics corpora, and database bulk import/export operations. Despite its advantages, TSV never displaced CSV in the spreadsheet ecosystem, largely because Excel defaults to comma separation and the .csv extension is more widely recognized.

Why CSV persists

JSON, XML, Parquet, and Avro all provide richer structure, explicit typing, and unambiguous parsing. Yet CSV remains the most common data interchange format. The reasons are simplicity and universality: a CSV file can be created with echo, read by any text editor, imported by every database, and processed by every programming language's standard library. No special tooling, no schema definition, no compilation step. For quick data exchange between humans who can visually inspect the result, nothing beats a spreadsheet exported as CSV.

.CSV compared to alternatives

.CSV compared to alternative formats
Formats Criteria Winner
.CSV vs .EXCEL (XLSX)
Data type preservation
XLSX stores explicit cell types — a cell marked as Text preserves leading zeros and prevents date conversion. CSV has no type system; every value is an untyped string, and consuming applications must guess types.
XLSX wins
.CSV vs .EXCEL (XLSX)
Portability and simplicity
CSV is plain text readable by any editor, language, or database. XLSX is a ZIP archive of XML files requiring a library to parse. CSV files are typically smaller for text-only tabular data.
CSV wins
.CSV vs .TSV
Delimiter ambiguity
TSV uses tabs, which rarely appear in data values, eliminating the need for quoting and avoiding the comma-vs-decimal-separator conflict that plagues CSV in European locales.
TSV wins
.CSV vs .JSON
Structure and typing
JSON supports nested objects, arrays, explicit null, booleans, and numbers as distinct types. CSV is flat (rows and columns only) with no native type system — everything is an implicit string.
JSON wins
.CSV vs .JSON
Human readability for tabular data
Tabular data in CSV aligns naturally in columns visible in any text editor. The same data in JSON repeats key names for every row and adds structural characters (braces, brackets, quotes), often tripling file size.
CSV wins

Technical reference

MIME Type
text/csv
Developer
IBM (earliest usage)
Year Introduced
1972
Open Standard
Yes — View specification

Binary Structure

CSV is a plain-text format with no binary structure, no magic bytes, and no fixed file header. Records appear one per line, terminated by CRLF (per RFC 4180) or LF (universally accepted). Fields within each record are separated by a delimiter — comma, semicolon, tab, or pipe depending on locale and tool. Fields containing the delimiter character, double quotes, or line breaks must be enclosed in double quotes. Double quotes inside quoted fields are escaped by doubling them. An optional header row may appear as the first line, declaring column names. There is no encoding declaration — files may be UTF-8, Windows-1252, Shift-JIS, or any other encoding with no in-file signal. Excel may prepend a UTF-8 BOM (bytes EF BB BF) when saving as 'CSV UTF-8,' which is the only reliable encoding indicator but is treated as data by some parsers. PRONOM assigns the x-fmt/ prefix to CSV specifically because it has no binary signature — identification relies entirely on content heuristics.

1972IBM Fortran compilers introduce list-directed I/O using commas to delimit data fields, establishing the pattern that becomes CSV1987CSV becomes the standard export format for MS-DOS databases (dBASE, Lotus 1-2-3), spreading to millions of business users1993IANA registers text/tab-separated-values for TSV — tab-delimited CSV's sibling gets an official MIME type 12 years before CSV itself2005RFC 4180 published as an Informational RFC, describing existing CSV conventions over 30 years after the format was already ubiquitous2008pandas released for Python, making pd.read_csv() the dominant interface for loading CSV data in data science workflows2015W3C publishes CSV on the Web (CSVW) recommendations, defining JSON-LD metadata for column types, encoding, and transformations2020Human Gene Nomenclature Committee renames 27 human genes to avoid Excel auto-converting gene symbols like SEPT7 and MARCH1 to dates
View CSV as a formatted table in the terminal other
csvlook input.csv

Part of the csvkit suite (pip install csvkit). Renders CSV data as an aligned ASCII table with column headers, handling quoted fields and embedded commas correctly.

Run SQL queries directly on a CSV file other
csvsql --query "SELECT name, age FROM input WHERE age > 30" input.csv

csvsql (part of csvkit) creates an in-memory SQLite database from the CSV, executes the SQL query, and outputs results as CSV. No database server required.

Convert CSV to JSON with column headers as keys other
python3 -c "import csv,json,sys; r=csv.DictReader(open('input.csv')); print(json.dumps(list(r),indent=2))"

Python's csv.DictReader uses the first row as dictionary keys. Each data row becomes a JSON object with column names as keys and all values as strings — no type coercion, no data corruption.

Count rows and show column statistics other
csvstat input.csv

csvstat (part of csvkit) reports row count, column count, data types, min/max values, mean, median, and unique counts for every column. Useful for profiling unfamiliar datasets.

Filter rows matching a pattern other
csvgrep -c "City" -m "Riyadh" input.csv

csvgrep filters CSV rows where the City column matches 'Riyadh' exactly. The -c flag specifies the column by name, and -m specifies the match string. Output is valid CSV with the header preserved.

Conversion not yet available. CSV to Excel transformation requires spreadsheet rendering — a feature planned for a future update.

LOW

Attack Vectors

  • CSV formula injection
  • Excel type coercion data corruption

Mitigation: FileDex does not execute CSV formulas or apply type coercion. This is a reference page providing format documentation and CLI tool guidance. When processing CSV in spreadsheet applications, prefix formula-triggering characters with a single quote or tab to prevent execution.

csvkit tool
Command-line suite for CSV processing: csvsql (SQL queries), csvstat (statistics), csvlook (table view), csvgrep (filtering), csvjoin (joins)
pandas library
Python data analysis library with pd.read_csv() supporting chunked reading, encoding detection, and dtype specification for safe CSV import
Command-line tool for CSV, TSV, and JSON data processing with SQL-like verbs operating on streaming data without loading entire files into memory
xsv tool
High-performance Rust CSV toolkit for indexing, slicing, searching, joining, and computing statistics on large CSV files
Open-source spreadsheet with a CSV import dialog that lets users specify delimiter, encoding, column types, and text qualifier before loading
Google Sheets service
Cloud spreadsheet that imports CSV with automatic delimiter detection and preserves text formatting better than Excel for most cases