Daff - Data Diff

Tabular diff specification

Version 0.8; 27 May 2014; Paul Fitzpatrick.

Summary

Assume we have two tables, called LOCAL and REMOTE. One way to compare them is to describe the changes needed to modify LOCAL to match REMOTE. For the tabular diff format, we can express the following kinds of changes:

  • Inserted or deleted rows.
  • Inserted, deleted, or renamed columns.
  • Modified cell values.

If the order of the rows or columns of the the table are meaningful, then we can also express:

  • Reordered rows or columns.

Changes in formatting and systematic transformation of data (such as capitalization) are not expressible.

General structure

The diff contains rows and columns from the tables being compared. As in regular text diffs, there is flexibility in what data is given and what is omitted.

  • A column or row that is common to the tables being compared should appear at most once.
  • Any column or row containing a modified cell should be included in the diff, and the modified cell should be represented using the procedure in Expressing a modified cell value.
  • Columns or rows that are present in one table and not in the other should be included in the diff.
  • Columns or rows that are unchanged and unneeded for context may be omitted, at the diff creator’s discretion.
  • Omitted blocks of rows or columns should be marked with a row/column full of “…” cells.

In addition, the diff contains the following special rows and columns:

  • The action column. This is always present, and is the first column in the diff if columns are ordered. If columns are not ordered, it is the column named __hilite_diff__.
  • A header row with column names. This row can be recognized since it will have the tag @@ in the action column.
  • A schema row that is needed when the column structure differs between tables. This row can be recognized since it will have the tag ! in the action column.

Here’s an example diff, where the tables being compared share the same three columns:

action
column
data from compared tables
header row@@bridgedesignerlength
BrooklynJ. A. Roebling1595
+++ManhattanG. Lindenthal1470
->WilliamsburgD. Duck->L. L. Buck1600
QueensboroughPalmer & Hornbostel1182
omitted rows............
George WashingtonO. H. Ammann3500
---SpamspanS. Spamington10000

The colors do not make up part of this specification, they are just syntax highlighting. The meaning of the various tags will become clear in later sections, for now we are just concerned with the structure of the diff. Here’s an example where there is a difference in columns: LOCAL has a length column that is removed in REMOTE, REMOTE has an opened column that wasn’t present in LOCAL, and the designer column in LOCAL is renamed as lead designer in REMOTE:

action
column
data from compared tables
schema row!+++(designer)---
header row@@bridgeopenedlead designerlength
+Brooklyn1883J. A. Roebling1595
+Manhattan1909G. Lindenthal1470
+Williamsburg1903L. L. Buck1600
+Queensborough1909Palmer & Hornbostel1182
+Triborough1936O. H. Ammann1380,383
+Bronx Whitestone1939O. H. Ammann2300
+Throgs Neck1961O. H. Ammann1800
+George Washington1931O. H. Ammann3500

We see that a schema row is added above the header row to represent the changes in columns. With this general anatomy of a diff in mind, let’s look at the details of how to interpret it.

NOTE: If writing a rule to “sniff” a file to see if it is a tabular diff, the @@ tag is a handy tell-tale. But watch out for that schema row! Also, to allow for future evolution of this format, please try to be robust to a few extra rows or columns appearing before the @@.

Expressing inserted and deleted columns

An inserted column is expressed simply by including that column in the diff, and placing +++ in the schema row above the corresponding column name in the header row. Similarly, a deleted column is expressed by including that column in the diff, and placing --- in the schema row above the corresponding column name. As a special case, a renamed column is represented by simply placing its old name in parentheses in the schema row.

In our earlier example, LOCAL has the columns bridge, designer, and length, while REMOTE has the columns bridge, opened, and lead designer (designer renamed). So opened is inserted and length is deleted:

action
column
data from compared tables
schema row!+++(designer)---
header row@@bridgeopenedlead designerlength
+Brooklyn1883J. A. Roebling1595
+Manhattan1909G. Lindenthal1470
+Williamsburg1903L. L. Buck1600

If we are dealing with a data store where columns are unordered, then likewise column order in the diff is irrelevant. Otherwise, the inserted and deleted rows should be placed in their appropriate order.

Any rows in the diff that are present only the LOCAL table will leave inserted columns blank. Similarly, any rows in the diff that are present only in the REMOTE table will leave deleted columns blank. Rows that are present in both tables will have values in all cells.

Expressing inserted and deleted rows

An inserted row is expressed simply by placing +++ in the action column, and placing cell values in the appropriate columns. If there are columns in the diff that are in LOCAL but not in REMOTE, these are left blank. Likewise, a deleted row is expressed by placing --- in the action column, and its cell values in the appropriate columns. If there are columns in the diff that are in REMOTE but not in LOCAL, these are left blank. For example, suppose in REMOTE there is a row about a New Bridge that wasn’t in LOCAL, and a row about a bridge called Spamspan has been dropped. Here is what the inserted and deleted rows would look like, lined up with the header row for reference:

action
column
schema row!+++(designer)---
header row@@bridgeopenedlead designerlength
 
inserted row+++New Bridge2050Chimp N Zee   
 
deleted row---SpamspanS. Spamington10000

If the diff is on a database table where rows have no ordering, then we can just stick these rows together and we have our diff:

action
column
schema row!+++(designer)---
header row@@bridgeopenedlead designerlength
inserted row+++New Bridge2050Chimp N Zee   
deleted row---SpamspanS. Spamington10000

If the diff is on a spreadsheet table or CSV file, we’d generally want to respect row ordering. In this case, we can add context rows around insertions so we know where to put them. Less importantly, since they are going away anyway, we can do the same for deletions:

action
column
schema row!+++(designer)---
header row@@bridgeopenedlead designerlength
omitted rows...............
context row+Williamsburg1903L. L. Buck1600
inserted row+++New Bridge2050Chimp N Zee   
context row+Queensborough1909Palmer & Hornbostel1182
omitted rows...............
context row+George Washington1931O. H. Ammann3500
deleted row---SpamspanS. Spamington10000

The action column for a context row may contain a blank, or :, or +. The : tag signifies the context row was moved (and its location is now as in the REMOTE table). The + signifies that there are cells added on that row.

Expressing a modified cell value

If a row contains a cell whose value is different in the compared tables, then that row should be shown in the diff, with a tag in the action column that ends in ->. Then, the modified cell will be represented by converting the LOCAL and REMOTE values to text (we have yet to say how) and using the action tag as a separator. So for example here we change the last cell in a row from “Green” to “Blue”:

->GnomeHome and GardenGreen->Blue

The tag must be preceded with as many extra - characters as are needed to avoid collision with any character sequence on that row. So here is another row with exactly one cell changed:

-->ConsoleToddlers -> TeenagersWhite-->Pale

When encoding a cell change as a string, we lose information about the type of the cell value. One distinction that may be important to retain is the difference between a NULL or empty cell, and the empty string. The tabular diff uses the following encoding:

  • A NULL value, if available, represents itself.
  • The encoded string NULL represents a NULL value.
  • The encoded string _NULL represents the string “NULL”.
  • The encoded string __NULL represents the string “_NULL”.

The goal is that the diff can be safely converted to and from CSV by existing tools without changing its meaning. To that end:

  • For matching (e.g. on context lines) blank cells in the diff (either the NULL value or an empty string) should be treated as ambiguous, and match either of NULL or an empty string if an exact match is not available.
  • When using a diff as a patch, and inserting new cells, a blank cell in the diff (either the NULL value or an empty string) should be treated as ambiguous, and the “right” thing done given the column type. If either value could be inserted, then the blank string should be inserted (since the encoded string NULL is available to specifically identify the NULL value).

Note that if the diff is being expressed in a table that allows nested structure (e.g. a JSON representation), a list representation for modified cells might be used to avoid this issue. There is no specification for that at this time.

Expressing a moved row

This can be ignored for tables for which row order is meaningless, e.g. in typical relational databases.

A row that have been moved in a table for which row order is meaningful is marked with a : tag in the action column and placed in the order it appears in the REMOTE table.

To avoid burdening human readers with too much arcana, tags are not combined when multiple kinds of actions apply to a row or column. So for example, a context row that was moved and had a cell added will not be tagged as :+ or +: or such-like, but rather by :. Cell addition can be determined from the schema row. These weak tags are included as aids for highlighting to express the most significant thing to know about a row.

Expressing a moved column

This can be ignored for tables for which column order is meaningless.

A column that have been moved in a table for which column order is meaningful is marked with a : tag in the schema row and placed in the order it appears in the REMOTE table.

If a diff that contains a : tag is used to patch a table for which column order is not meaningful, that tag should simply be ignored.

Reference: action column tags

Symbol Meaning
@@ The header row, giving column names.
! The schema row, given column differences.
+++ An inserted row (present in REMOTE, not present in LOCAL).
--- A deleted row (present in LOCAL, not present in REMOTE).
-> A row with at least one cell modified cell. -->, --->, ----> etc. have the same meaning.
Blank A blank string or NULL marks a row common to LOCAL and REMOTE, given for context.
... Declares that rows common to LOCAL and REMOTE are being skipped.
+ A row with at least one added cell.
: A reordered row.

Reference: Schema row tags

Symbol Meaning
+++ An inserted column (present in REMOTE, not present in LOCAL).
--- A deleted column (present in LOCAL, not present in REMOTE).
(<NAME>) A renamed column (the name in LOCAL is given in parenthesis, and the name in REMOTE will be in the header row).
Blank A blank string or NULL marks a column common to LOCAL and REMOTE, given for context.
... Declares that columns common to LOCAL and REMOTE are being skipped.
: A reordered column.