Identifier column in data sets from 2008 to 2010

(Victor Nițu) #1

We have a problem – we need to find a way to identify contracts from 2008-2010 sets. The id column is massively repeated in the CSV files. Anything else that could be extracted from sources and might help us with this?

Attached the complete output of counts (how many times each id is replicated, only for > 1 occurence)

$ cat 2008-2010.csv | awk -F',' '{print $1;}' | sort | uniq -c | sort -n | grep -v ' 1 '
Output: output.txt (2.2 MB)

(Victor Nițu) #2

Probably worth mentioning that 2008-2010.csv is a product of concatenating the 20{08,09,10} source files :smile:

Also some sample output:

$ cat 2008-2010.csv | awk -F',' '{print $1;}' | sort | uniq -c | sort -n | grep -v ' 1 ' | tail      
    430 130620-2009
    447 45506-2010
    519 31819-2010
    543 131143-2010
    543 192311-2010
    610 213631-2009
    620 51572-2010
    660 275353-2008
    743 40063-2009
    755 346493-2010


@victor talked w Georigiana about this yesterday, I will have a look at this as soon as I can.
It may as well be a problem of data cleanliness: I have found a contract that despite of being in the /en/ xml path contained a contract in all the languages … making it absurdly huge

(Victor Nițu) #4

Oh, that would explain a lot. Do you have any post-export checking method you currently use? I mean, I don’t know how to check if that’s the case or there’s some other unforeseen problem as well.

Random useful fact: today I learned cut can be considerably faster than awk for very basic operations.


no, I have absolutely nothing in place. Probably should. Live and learn.

Another random useful fact: cut isn’t nearly as awesome as csvkit

(Victor Nițu) #6

Whoa :frowning: And all those horrible awk one liners… useless… when I could’ve done it the easy way!