24 January 2017

CSV

PSV

So last month an idea surfaced from @j4mie for an alternative data format: poop separated values (PSV). Here’s the complete spec.

PSV Spec

At first, I laughed. Then, maybe I cried a little – 2016 went in a kind of 💩y direction at the end there. Finally I started thinking. I realized that PSV is a brilliant idea, and here’s why.

What we’re talking about here is the :poop: emoji, standarized as U+1F4A9 PILE OF POO in Unicode 6.0. Look at that unicode for a minute: U+1F4A9. That’s a big number. That’s outside the Basic Multilingual Planes my friends, and into the Astral Planes of the unicode standard. You have to have your act together to deal with this. For example, Mathias Bynens covers all the muddles javascript gets into with poop in Javascript has a Unicode Problem. This is not the 128-odd ASCII characters your grandmother grew up with.

So isn’t that an argument not to use an astral symbol as a separator? Let’s reflect on the failings of CSV. Jesse Donat has a great list in a piece called Falsehoods-Programmers-Believe-About-CSVs, mirroring similiar lists about names, geography and so on. The first 8 falsehoods are all encoding related. With CSV, if your data is a table of numbers, you don’t really have to think about encoding at all. That’s nice right up until the moment a non-ASCII character sneaks in there it all goes pear shaped. But if we lead with a mandatory poop symbol, from an astral plane no less, no-one is going to be able to punt on the encoding issue. You have to get it right up front.

The logical conclusion of this idea would be to borrow a string like Iñtërnâtiônàlizætiøn☃💩 from unit tests and use that as the delimiter. But 💩 alone gets us a good way there, and looks cleaner. Err.

Here’s an example of PSV in action. I’m editing a PSV file called checklist.psv that lists my current goals in life. I’m using emacs to edit the file, and git to view differences against a previous version.

PSV and daff

I’ve configured git here to use daff to view tabular differences cleanly. I’m doing this on my phone because phones currently excel at showing emoji – the same thing on a laptop also works fine but the poop is less cheerful looking.

One danger with PSV is that people could get sloppy about quoting rules. With CSV, you’ve a good chance of seeing a comma in your data, so you deal with quoting sooner rather than later to disambiguate that. The business data I’ve seen in CSV form has never had 💩 in it, so I could imagine someone skimping on quoting. One solution for that is for more of us to put poop in our names and transactions, Little Bobby Tables style (xkcd, @mopman’s company).

There aren’t a lot of programs supporting PSV yet. So far as I know, daff is the first. The purpose of daff is making tabular diffs and helping with version control of data, but until format converters crop up you can use it to convert to and from psv as follows:

pip install daff           # or npm install daff -g
daff copy foo.csv foo.psv  # convert csv -> psv
daff copy foo.psv foo.csv  # convert psv -> csv

Or you can write into the text boxes at the start of this post :-).

@fitzyfitzyfitzy