Sara recently noticed a problem with a comma-separated value (CSV) formatted dataset generated by the AAVSO data download web form:
In particular, commas were present in fields such as commentcodes, comments, and comparison star fields. Additional commas confuse software attempting to load such a dataset (such as VStar) or people viewing it via software such as Excel.
This problem was also previously captured in a VStar SourceForge bug tracker:
http://sourceforge.net/tracker/?func=detail&aid=2952563&group_id=263306…
Part of the summary says:
For CSV AAVSO download format files, if the comment field (or any other text field) contains commas, this will cause VStar (or any other program) to misinterpret subsequent fields. In VStar this sometimes manifests as invalid observations with more than the max number of expected fields.
One way to handle this is to wrap each field in quotes. This is an option provided by spreadsheet apps like Excel.
As noted later in that tracker, in the past one solution has been to detect and eradicate commas in AAVSO Download Format, replacing them with spaces.
Will has suggested a more standard solution (also noted in the tracker) used by Excel: wrap fields containing commas in quotes. For example:
1,2,foo, bar,4
becomes:
1,2,"foo, bar",4
When a tool sees a field starting with a double-quote it must treat everything up until the closing double-quote as a single field. In this example, the fields would be:
1
2
foo, bar
4
It doesn't stop with embedded commas though. If a field contains quotes, these must also be delimited (or "escaped") by quotes. For example:
1,2,"foo", bar,4
becomes:
1,2,"""foo"", bar",4
In this example, the fields would be:
1
2
"foo", bar
4
Other issues are whether to:
- quote leading and trailing whitespace in fields (if intended to be significant);
- permit a field to be split across a line (by quoting);
- extend quoting to Tab Separated and Space Separated download formats, also permitted by the download web form;
- have an option to always quote every field;
- make quoting optional for backward compatibility for a finite amount of time before deprecating it.
Googling CSV will tell you more than you ever wanted to know about CSV. Here's a couple of pages that summarise the issue:
- http://en.wikipedia.org/wiki/Comma-separated_values
- http://supercsv.sourceforge.net/csvSpecification.html
There are "standards" documents of sorts (see the first link above) that are not always as helpful as might be hoped.
Whatever changes to the AAVSO Download CSV format are implemented server-side, any software that is dependent upon that format may need to be modified, depending upon how comprehensive its CSV parsing is and exactly what changes are made, and tested.
This post is intended to permit open discussion of the proposed changes and when an approach is decided upon by AAVSO staff, this can be made known here along with some idea of timelines so that tool developers can have some warning of the need to update their tools, if necessary.
VStar validates fields and tries to be as accommodating as possible before putting an observation into the "invalid" category, but if fields are incorrectly identified due to embedded commas etc, there's only so much it can do. When the time comes to modify VStar, I will consider not extending my existing hand-crafted implementation, but instead making use of an open source library (e.g. http://opencsv.sourceforge.net/) to provide a drop-in replacement for the small amount of code that retrieves fields from each row.
Thanks for listening.
Regards,
David Benn (BDJB)
One approach is (as I think Will and Aaron have considered) to make use of the CSV generation features/libraries of the programming language in which the data download tool is written (PHP or Perl).
Then any tool (such as VStar) that will consume the output of that CSV generation must become compliant with that output.
The main question re: this approach is:
Regards,
David (BDJB)
Another comment about this. A checkbox could be added to http://www.aavso.org/data-download asking whether to "Quote Columns?" or "Quote Fields?" or "Quote Values?" for CSV, TSV.
This would:
David