Tabular diff specification
Version 0.8; 27 May 2014; Paul Fitzpatrick.
Summary
Assume we have two tables, called LOCAL
and REMOTE
. One way to
compare them is to describe the changes needed to modify LOCAL
to
match REMOTE
. For the tabular diff format, we can express the
following kinds of changes:
- Inserted or deleted rows.
- Inserted, deleted, or renamed columns.
- Modified cell values.
If the order of the rows or columns of the the table are meaningful, then we can also express:
- Reordered rows or columns.
Changes in formatting and systematic transformation of data (such as capitalization) are not expressible.
General structure
The diff contains rows and columns from the tables being compared. As in regular text diffs, there is flexibility in what data is given and what is omitted.
- A column or row that is common to the tables being compared should appear at most once.
- Any column or row containing a modified cell should be included in the diff, and the modified cell should be represented using the procedure in Expressing a modified cell value.
- Columns or rows that are present in one table and not in the other should be included in the diff.
- Columns or rows that are unchanged and unneeded for context may be omitted, at the diff creator’s discretion.
- Omitted blocks of rows or columns should be marked with a row/column full of “…” cells.
In addition, the diff contains the following special rows and columns:
- The action column. This is always present, and is the first
column in the diff if columns are ordered. If columns are not
ordered, it is the column named
__hilite_diff__
. - A header row with column names. This row can be recognized since
it will have the tag
@@
in the action column. - A schema row that is needed when the column structure differs
between tables. This row can be recognized since it will have the
tag
!
in the action column.
Here’s an example diff, where the tables being compared share the same three columns:
action column | data from compared tables | |||
header row | @@ | bridge | designer | length |
---|---|---|---|---|
Brooklyn | J. A. Roebling | 1595 | ||
+++ | Manhattan | G. Lindenthal | 1470 | |
-> | Williamsburg | D. Duck->L. L. Buck | 1600 | |
Queensborough | Palmer & Hornbostel | 1182 | ||
omitted rows | ... | ... | ... | ... |
George Washington | O. H. Ammann | 3500 | ||
--- | Spamspan | S. Spamington | 10000 |
The colors do not make up part of this specification, they are just
syntax highlighting. The meaning of the various tags will become clear
in later sections, for now we are just concerned with the structure of
the diff. Here’s an example where there is a difference in columns:
LOCAL
has a length column that is removed in REMOTE
, REMOTE
has
an opened column that wasn’t present in LOCAL
, and the designer
column in LOCAL
is renamed as lead designer in REMOTE
:
action column | data from compared tables | ||||
schema row | ! | +++ | (designer) | --- | |
header row | @@ | bridge | opened | lead designer | length |
---|---|---|---|---|---|
+ | Brooklyn | 1883 | J. A. Roebling | 1595 | |
+ | Manhattan | 1909 | G. Lindenthal | 1470 | |
+ | Williamsburg | 1903 | L. L. Buck | 1600 | |
+ | Queensborough | 1909 | Palmer & Hornbostel | 1182 | |
+ | Triborough | 1936 | O. H. Ammann | 1380,383 | |
+ | Bronx Whitestone | 1939 | O. H. Ammann | 2300 | |
+ | Throgs Neck | 1961 | O. H. Ammann | 1800 | |
+ | George Washington | 1931 | O. H. Ammann | 3500 |
We see that a schema row is added above the header row to represent the changes in columns. With this general anatomy of a diff in mind, let’s look at the details of how to interpret it.
NOTE: If writing a rule to “sniff” a file to see if it is a tabular
diff, the @@
tag is a handy tell-tale. But watch out for that schema
row! Also, to allow for future evolution of this format, please try to
be robust to a few extra rows or columns appearing before the @@
.
Expressing inserted and deleted columns
An inserted column is expressed simply by including that column in the
diff, and placing +++
in the schema row above the corresponding column
name in the header row. Similarly, a deleted column is expressed by
including that column in the diff, and placing ---
in the schema row
above the corresponding column name. As a special case, a renamed column
is represented by simply placing its old name in parentheses in the
schema row.
In our earlier example, LOCAL
has the columns bridge, designer,
and length, while REMOTE
has the columns bridge, opened, and
lead designer (designer renamed). So opened is inserted and
length is deleted:
action column | data from compared tables | ||||
schema row | ! | +++ | (designer) | --- | |
header row | @@ | bridge | opened | lead designer | length |
---|---|---|---|---|---|
+ | Brooklyn | 1883 | J. A. Roebling | 1595 | |
+ | Manhattan | 1909 | G. Lindenthal | 1470 | |
+ | Williamsburg | 1903 | L. L. Buck | 1600 |
If we are dealing with a data store where columns are unordered, then likewise column order in the diff is irrelevant. Otherwise, the inserted and deleted rows should be placed in their appropriate order.
Any rows in the diff that are present only the LOCAL
table will leave
inserted columns blank. Similarly, any rows in the diff that are present
only in the REMOTE
table will leave deleted columns blank. Rows that
are present in both tables will have values in all cells.
Expressing inserted and deleted rows
An inserted row is expressed simply by placing +++
in the action
column, and placing cell values in the appropriate columns. If there are
columns in the diff that are in LOCAL
but not in REMOTE
, these are
left blank. Likewise, a deleted row is expressed by placing ---
in the
action column, and its cell values in the appropriate columns. If there
are columns in the diff that are in REMOTE
but not in LOCAL
, these
are left blank. For example, suppose in REMOTE
there is a row about a
New Bridge that wasn’t in LOCAL
, and a row about a bridge called
Spamspan has been dropped. Here is what the inserted and deleted rows
would look like, lined up with the header row for reference:
action column | |||||
schema row | ! | +++ | (designer) | --- | |
header row | @@ | bridge | opened | lead designer | length |
---|---|---|---|---|---|
inserted row | +++ | New Bridge | 2050 | Chimp N Zee | |
deleted row | --- | Spamspan | S. Spamington | 10000 |
If the diff is on a database table where rows have no ordering, then we can just stick these rows together and we have our diff:
action column | |||||
schema row | ! | +++ | (designer) | --- | |
header row | @@ | bridge | opened | lead designer | length |
---|---|---|---|---|---|
inserted row | +++ | New Bridge | 2050 | Chimp N Zee | |
deleted row | --- | Spamspan | S. Spamington | 10000 |
If the diff is on a spreadsheet table or CSV file, we’d generally want to respect row ordering. In this case, we can add context rows around insertions so we know where to put them. Less importantly, since they are going away anyway, we can do the same for deletions:
action column | |||||
schema row | ! | +++ | (designer) | --- | |
header row | @@ | bridge | opened | lead designer | length |
---|---|---|---|---|---|
omitted rows | ... | ... | ... | ... | ... |
context row | + | Williamsburg | 1903 | L. L. Buck | 1600 |
inserted row | +++ | New Bridge | 2050 | Chimp N Zee | |
context row | + | Queensborough | 1909 | Palmer & Hornbostel | 1182 |
omitted rows | ... | ... | ... | ... | ... |
context row | + | George Washington | 1931 | O. H. Ammann | 3500 |
deleted row | --- | Spamspan | S. Spamington | 10000 |
The action column for a context row may contain a blank, or :
, or +
.
The :
tag signifies the context row was moved (and its location is now
as in the REMOTE
table). The +
signifies that there are cells added
on that row.
Expressing a modified cell value
If a row contains a cell whose value is different in the compared
tables, then that row should be shown in the diff, with a tag in the
action column that ends in ->
. Then, the modified cell will be
represented by converting the LOCAL
and REMOTE
values to text (we
have yet to say how) and using the action tag as a separator. So for
example here we change the last cell in a row from “Green” to “Blue”:
-> | Gnome | Home and Garden | Green->Blue |
The tag must be preceded with as many extra -
characters as are needed
to avoid collision with any character sequence on that row. So here is
another row with exactly one cell changed:
--> | Console | Toddlers -> Teenagers | White-->Pale |
When encoding a cell change as a string, we lose information about the type of the cell value. One distinction that may be important to retain is the difference between a NULL or empty cell, and the empty string. The tabular diff uses the following encoding:
- A NULL value, if available, represents itself.
- The encoded string
NULL
represents a NULL value. - The encoded string
_NULL
represents the string “NULL”. - The encoded string
__NULL
represents the string “_NULL”. - …
The goal is that the diff can be safely converted to and from CSV by existing tools without changing its meaning. To that end:
- For matching (e.g. on context lines) blank cells in the diff (either the NULL value or an empty string) should be treated as ambiguous, and match either of NULL or an empty string if an exact match is not available.
- When using a diff as a patch, and inserting new cells, a blank
cell in the diff (either the NULL value or an empty string) should
be treated as ambiguous, and the “right” thing done given the
column type. If either value could be inserted, then the blank
string should be inserted (since the encoded string
NULL
is available to specifically identify the NULL value).
Note that if the diff is being expressed in a table that allows nested structure (e.g. a JSON representation), a list representation for modified cells might be used to avoid this issue. There is no specification for that at this time.
Expressing a moved row
This can be ignored for tables for which row order is meaningless, e.g. in typical relational databases.
A row that have been moved in a table for which row
order is meaningful is marked with a :
tag
in the action column and placed in the order it appears in the REMOTE
table.
To avoid burdening human readers with too much arcana, tags are not
combined when multiple kinds of actions apply to a row or column. So for
example, a context row that was moved and had a cell added will not
be tagged as :+
or +:
or such-like, but rather by :
. Cell
addition can be determined from the schema row. These weak tags are
included as aids for highlighting to express the most significant
thing to know about a row.
Expressing a moved column
This can be ignored for tables for which column order is meaningless.
A column that have been moved in a table for which column
order is meaningful is marked with a :
tag
in the schema row and placed in the order it appears in the REMOTE
table.
If a diff that contains a :
tag is used to patch a table
for which column order is not meaningful, that tag should simply be
ignored.
Reference: action column tags
Symbol | Meaning |
---|---|
@@ |
The header row, giving column names. |
! |
The schema row, given column differences. |
+++ |
An inserted row (present in REMOTE , not present in LOCAL ). |
--- |
A deleted row (present in LOCAL , not present in REMOTE ). |
-> |
A row with at least one cell modified cell. --> , ---> , ----> etc. have the same meaning. |
Blank | A blank string or NULL marks a row common to LOCAL and REMOTE , given for context. |
... |
Declares that rows common to LOCAL and REMOTE are being skipped. |
+ |
A row with at least one added cell. |
: |
A reordered row. |
Reference: Schema row tags
Symbol | Meaning |
---|---|
+++ |
An inserted column (present in REMOTE , not present in LOCAL ). |
--- |
A deleted column (present in LOCAL , not present in REMOTE ). |
(<NAME>) |
A renamed column (the name in LOCAL is given in parenthesis, and the name in REMOTE will be in the header row). |
Blank | A blank string or NULL marks a column common to LOCAL and REMOTE , given for context. |
... |
Declares that columns common to LOCAL and REMOTE are being skipped. |
: |
A reordered column. |