Hacker News new | past | comments | ask | show | jobs | submit login

    I normally dislike working with survey data since there 
    is a high possibility of selection bias among 
    the respondents.[..] For this reason, I will 
    show confidence intervals whenever possible to
    reflect the proportionate uncertainty for 
    groupings with insufficient data [..]
That... is not how statistics work(?). I mean – confidence intervals help with small sample sizes, but they do nothing for systematic errors such as those introduced by selection bias.

    [continued from above] and to also account for 
    possibility that a minority of respondents may 
    be dishonest and nudge their programming ability 
    a few points higher than the truth.
There's a surprising amount of assumptions that went into this sentence. I'd question the assertions that:

- people are "dishonest" (My intuition would point to a subconscious bias more than actual dishonesty)

- It's a minority (The second chart shows that >50% of respondents with one year of experience or less rate themselves as better than average).

- That subconscious biases only work in one direction

... and once again I have no idea how confidence intervals can help. A large interval may indicate bad measurement. It may also indicate high variability in the actual data.

    Also, keep in mind that these groupings alone 
    do not imply a causal relationship between 
    the two variables.
... someone paid attention in his middle-school statistics club...

    Employing traditional regression analysis to 
    build a model for predicting programming ability
    would be tricky: does having more experience 
    cause programming skill to improve, or does having
    strong innate technical skill cause developers to
    remain in the industry and grow?
... but failed statistics 101. A regression analysis doesn't care about causality. If "Mac users are more likely to be college-educated" it doesn't matter that "buying a Mac may not actually make you smarter". I can still make the prediction "a given Mac user is more likely to have a college degree".



Your condescending remarks ("...someone paid attention in his middle-school statistics club") detract from what is otherwise a very useful comment.


OP of the article here.

Microaggressions aside, these are fair counterpoints. I spent far less time editing the body of the post than optimizing the visualizations/Jupyter Notebook (especially in this particular post). I've taken more care in future posts since.


If you are looking for feedback, here is another suggestion for future posts: I found the use of violin plots for discrete data to be confusing. To be honest, I still am not sure how to interpret the unlabeled Y axis. I think a histogram would have been easier for me (and others) to interpret.

But suggestions aside, I found your article to be interesting. Thank you for it.


I second the criticism of using violin plots here. A violin plot is a kernel density plot. It is designed to show a distribution at a scale much larger than variations in the data. Its raison d'etre is to smooth out and aggregate these small-scale variations.

But in this survey data, the values in the distribution are spaced far apart. The discretization is so large that the violin plots show meaningless and weirdly inconsistent curves between x-values which actually have data. A bar plot would be much clearer.

OP, I think you have done a fine job of styling your plots tastefully. But I recommend taking another look at the visual language you have chosen to communicate the data.


Agreed. The violin plot was an experiment that did not work out in retrospect.


I learned that violin plots exist, so your experiment was worth it to me :)


I liked it :)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: