Hacker Newsnew | past | comments | ask | show | jobs | submit | hermitcrab's commentslogin

Yes. I really enjoyed it. Visually spectacular, pretty faithful to the book (as far as I remember the book) and Ryan Gosling was very good in it. They were even fairly faithful to the laws of physics (relative to other Hollywood movies, anyway).

Found it excruciating to watch. As in boring. And there are a few times where the movie seems ready to make its landing and finish but something happens to extend it. I just couldn't get myself to care. The recent Sam Raimi movie I enjoyed

The trackpad on my 2.5 year old Macbook Air stopped working. Apple wanted over £400 to fix it. Thankfully I found a local guy who did it for a fraction of that. Screw Apple.

>we don’t have a chunk of a full-time data analyst’s salary to spend on it

I found the errors in a few minutes with a $198 tool.


There an obvious incentive for petrol stations to 'accidentally' put too low price, so they can get top of the table on services like yours. So they probably need to do more than add warnings.

Why did the title of this post get moderated from:

"Stop Publishing Garbage Data, It’s Embarrassing"

To the rather lamer:

"Twice this week, I have come across embarassingly bad data"

?


OP here. Ouch indeed. I did actually get it proofread. But that was missed. I can't fire my proofreader, as we are married. ;0)

Now fixed.


Ha ha. That was quick. Well done :)

Not fixed at this hour

You might need to do a refresh.

So you expect the 1000s of people trying to use the fuel price data to each individually clean and validate it, rather than the supplier doing it?

One of those people can republish their cleaned and validated version and the 999 others can compare it to the original to decide whether they agree with the way it was cleaned or not.

What...?

Hard disagree on that. They just need a basic smell test before they put it out.

>Clean data is expensive--as in, it takes real human labor to obtain clean data.

Yes, data can contain subtle errors that are expensive and difficult to find. But the 2nd error in the article was so obvious that a bright 10 year would probably have spotted it.


Agreed--and maybe they should have fixed it.

But sometimes the "provenance" of the data is important. I want to know whether I'm getting data straight from some source (even with errors) rather than having some intermediary make fixes that I don't know about.

For example, in the case where maybe they flipped the latitude and longitude, I don't want them to just automatically "fix" the data (especially not without disclosing that).

What they need to do is verify the outliers with the original gas station and fix the data from the source. But that's much more expensive.


Exactly. This is a big problem with "open data". A lot goes into cleaning it up to make it publishable, which often includes removing data so that the public "doesn't get confused". Now I have to spend months and months fighting FOIA fights to get the original raw, messy data because someone , somewhere had opinions on what "clean data" is. I'll pass -- give me the raw, messy data.

I do not disagree with that, but I am not sure what "raw data" means in some cases like the ones the article talks about. The 1.700.000 is no less or more raw than 1.700,000. Most probably somebody messed up some decimals somewhere, or somebody imported a csv in excel and it misinterpreted the numbers due to different settings. Similar to swapped longitude/latitude. That sounds different to me than, let's say, noisy temperature data from sensors. Rather, it seems more like issues that arose at the point of merging datasets together, which is already far from the data being raw.

The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it, than a random person looking into that dataset. So I do not think it is unreasonable to have people in organisations take a second look into the datasets they publish.


When I say "raw", what I'm referring to is the preservation of the data's chain of custody. If I'm looking at the data with an intent to sue the respective government agency, then I have strong legal reasons to make sure that the data isn't modified. If I start from open data for example, the gov agency will have their data person sign an affidavit making this very clear and I will lose my case basically immediately.

  The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it
You'd think so, but just like most other systems, systems are often inherited or not thought out, so the understanding is external and we can't assume expertise within.

Or just omit the rows that are obviously wrong (and document the fact).

> omit the rows that are obviously wrong

This can skew the dataset and lead to misinterpreted results, if which rows are wrong is not completely random.

Eg if all data from a specific location (or year etc) comes wrong, then this kind of cleaning would just completely exclude this location, which depending on the context may or may not be a problem. Or if values come wrong above a specific threshold. Or any other way that the errors are not in some way randomly distributed.

Removing data is never a neutral choice, and it should always be taken into consideration (which data is removed).


>Removing data is never a neutral choice, and it should always be taken into consideration (which data is removed).

Absolutely. If you have obviously wrong data your choices are generally:

1. Leave the bad data in.

2. Leave the bad data in and flag it as suspect.

3. Omit the dad data.

4. Correct the bad data.

Which is the best choice depends on context and requires judgement. But I find it hard to imagine any situation where option 1 is the right choice.

Obviously the best solution is to do basic validation as the data is entered, so that people can't add a location in the Indian ocean to a UK dataset. It seems rather negligent that they didn't do this.


Like I said in a different post, there are legal reasons for why you would want the original data. Deleting the data from the dataset is negligent.

If you want something to blame, blame the system that allowed the data to be bad in the first place. You're pointing your finger at the wrong people and it's unreasonable of you to call them negligent.


"obviously wrong" is a never ending rabbit hole and you'll never, ever be satisfied because there will always be something "obviously wrong" with the data.

Messy data is a signal. You're wrong to omit signal.


100%. There is even signal in the pattern of errors. If you remove some errors but not others, you lose signal.

Deleting the row loses some information, such as the existence of that gas station.

A better solution is to add a field to indicate that "the row looks funny to the person who published the data". Which, I guess is useful to someone?

But deleting data or changing data is effectively corrupting source data, and now I can't trust it.


I love that they need a truck to transport 92 anti-protons.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: