Hacker News new | past | comments | ask | show | jobs | submit | drew's comments login

The problem is DS is really 2-3 different disciplines under one nebulous title. What you're describing is folks who are prototyping and productionizing models. That's definitely in short supply, but random STEM PhDs are in no way competitive for those roles unless they're coming from CS programs + have work experience in production engineering.

But that's by no means all of the DS field. There are lots of DS jobs where you're collecting and interpreting and communicating about complex data sets. An engineering mindset is occasionally helpful, but a bias towards building versus towards analyzing and writing can just as often be counter-productive. Not all problems are solved by systems; lots of problems are solved by better understanding the problem and then letting other specialists build the right solution.

The bootcamps have contributed to the problem by focusing so much on building things. The idea that you can go from an econ undergrad to being a self-sufficient member of a production ML team in 6-12 weeks is nuts. What's less nuts (and what I wish programs like Insight focused on) is taking people from having data skills in one domain and with one set of tools (e.g. logitudinal medical record data, stored in CSVs and handled in Stata) to another set of tools (billions of rows of event-based product data stored in a data warehouse, processed in R or Python). But instead the bootcamps behave like the missing skillset is the ability to make a predictive random forest model on some arbitrary data set and build an AWS web app around it. THAT job market definitely doesn't exist and is completely over-saturated.

But people who are smart communicators about data, can manipulate and make sense of massive data sets, can ask incisive questions about their data, and can use data to convince people of a complex argument are always going to have job opportunities, even if they're not production grade engineers. If that sounds like you, I'm hiring - hit me up on Twitter: @drewwww.


"Not all problems are solved by systems; lots of problems are solved by better understanding the problem and then letting other specialists build the right solution."

Agreed wholeheartedly. Reminds me of another quote from http://www.john-foreman.com/blog/surviving-data-science-at-t... :

""" You know what can keep up with a rapidly changing business?

Solid summary analysis of data. Especially when conducted by an analyst who's paying attention, can identify what's happening in the business, and can communicate their analysis in that chaotic context.

Boring, I know. But if you're a nomad living out of a yurt, you dig a hole, not a sewer system. """


I was part of the Data Science team in my previous company. We mainly build models for production, but we also were responsible for generating both daily and ad hoc reports. We tried to hire someone to take over the reporting part, but we found out even that requires engineering skills. This role ended up to even more difficult to hire for because it's hard to find someone who has the engineering skills but wants to work only on reporting. Maybe if we had a dedicated data engineering team the story would be different.


That is me. A decent developer who gets data and enjoys reporting, particularly the variety of hard problems that crop up.

I am happily serving a niche market with my own company and I suspect part of the difficulty in finding the skillset is that we can just start our own thing when we find a domain that we like.


Yeah, in my org we have a dedicated data engineering team that owns the pipeline and production data systems plus an analytics team that owns reporting and making data available to non-technical data consumers. That leaves complex research work for data scientists who are (in theory, anyway) building on top of stable data infrastructure and good data sources and free from ad hoc reporting work.


In France, and other Francophone countries, Statistics and Information Analysis is an engineering discipline.


In principle, Persona is great. Not storing passwords is awesome, a non-FB/Google/Twitter identity option is important.

I would encourage you, though, to look carefully at your login completion metrics. I implemented Persona on my site (http://www.sixquestions.co) to have a pure email option and although users clearly prefer it, about 35% complete the Persona login flow successfully. That's 10 points lower than our next-worst performer (Twitter), and half the rate of our best performer (Facebook). For all the concerns people have with authorizing Facebook/Twitter access, that is (in my view) offset by the alien-ness of Persona's login flow. We've heard from lots of users that logging in with Persona is unusual and they thought they were doing something wrong because they'd never seen anything like that.

So, as much as I believe in Persona, I'm about to deploy a change that removes it entirely. It adds a lot of surface area to our testing and future development, but if it means we lose fewer users in their signup flow, it will be worth it.


Here's an example: I just failed to login to Zonino myself.

I enter in the Gmail address that I use for registrations and other junk. I get the message: "Accounts don't match. You are currently signed into Google as [my normal Gmail address]. ... Force Google logout?" Forget that. I'm not interested in logging out of Gmail. Logging out of #1, into #2, out of #2 and back into #1 is more work that simple registration. I expect that I'm not the only person with this problem. I hope a solution can be found, because it would be really helpful.


Gah. We still need to switch from OpenID to OAuth for our GMail bridge; OpenID doesn't allow us to tell Google what address we're trying to authenticate. Sorry!


Need any help with that/is there a ticket?



Normally, when you're logged into multiple Google accounts - Google Bridge in Persona lets you pick which Google account you want to use. This error you're seeing seems like some sort of bug/UX issue (?) in Persona -> Google flow where if you're already signed into multiple Google accounts, but not the extra one you're trying to use - things don't work as smoothly..


Interesting. Persona likely needs work when it comes to multiple Gmail accounts when using the Account Bridging.


They're working on it: https://github.com/mozilla/persona-yahoo-bridge/issues/178 You should post your findings there


That's for Yahoo, not Gmail.


Do you get this problem if you use a different browser for the other gmail account?


No, and that is what I would usually do since I understand how these things work. (IE does come in handy at times.) The point that I am trying to make is that normal users who don't know these tricks can run into this barrier.


I don't think normal users have multiple Google accounts, though.


Eh, that depends.

All it takes is a personal Gmail account plus working for a company that uses Google Apps.


Agreed. Although Persona's technical basis and privacy protections are second to none, the UX is nothing to write home about. It still feels too much like OpenID, and we know what happened to OpenID. Facebook and Twitter can get away with cross-site redirects because they're well known and people trust them. Persona doesn't have that benefit, so it can't get away with the same cumbersome UX. It needs to do better, much better. The market is unfair. Deal with it.

If you're in the business of implementing an alternative login system, you should also seriously think about what kind of UX you're competing against. Your ultimate competitor isn't Facebook or Twitter. It's the good old email-and-password login system that everyone is used to. You enter your email address, select a password, and you're in, without ever leaving the signup page! It's even easier if you use a password manager like LastPass. That's what you're competing against, and if your UX has any more steps or redirects than that, you're probably doomed.


The paradox here is that people are more familiar with the appearance of home-grown-style login systems and are more willing to follow through on those than the novel Persona flow, even though the security characteristics of Persona are stronger. It's a chicken and egg problem, and until someone really big takes the plunge and gets everyone comfortable with this style, anyone implementing it is going to be somewhat of a cost to signups.


I think you've hit the nail on the proverbial head here.


If the bridge supports the 3-4 major email providers, it effectively becomes "log in with your email address" (it already supports Gmail), and A LOT of the friction goes away.


OpenID never really made it not just because of their bad UX design but also because they never got major players to push it to the public. Google or Facebook would much rather have you use their service as login credentials as it makes more monetary sense to do so than to hand it over to some non-profit foundation like Mozilla or OpenID. Data = money in this world and everyone wants more money.


Google, Yahoo and AOL all support(ed) OpenId login using their site as an IDP.

Google's FriendConnect was built on it.

That's a fair bit of "push".


There's 2 issues with persona.

1) users don't already have a personna account setup. They're used to hit their "login with FB/Google" account instead. They don't know that persona is better privacy-wise. So for many, it's just friction.

2) persona login sometimes appears slightly slower


There's an OpenID bridge [1] to make it easier for GMail users to sign in using Persona if they're currently logged into GMail/Yahoo! [2]. I haven't used it but the end goal of Persona is that 3rd party email providers can be their own Persona identity providers.

We definitely need something like Persona but I share your concerns WRT friction. Chicken meets egg.

[1] http://identity.mozilla.com/post/56526022621/what-is-an-iden...

[2] http://identity.mozilla.com/post/57712756801/persona-makes-s...


Your data (especially the fact that users clearly prefer it) tells me that they're clicking it out of curiosity, to see how they can log in with their email.

Unless by "clearly prefer it" you don't mean the initial button click, but the final login?


You can see how we communicate it on our site if you're curious. Basically it's a modal dialog with four options:

* Facebook * Google * Twitter * Email

We don't use the persona messaging, and I think people's expectation when they click the 'email' button is that it's going to just be a normal email flow. We don't call it Persona or Browser ID or use any of their assets or messaging, because we didn't think anyone would click on it if we did.

But yes, we see a small preference for a button labeled 'email' versus facebook, and a medium preference for either over twitter.


I did try it (and signed in) to your site. I'll admit, I knew I was going there for Persona, and "sign in via email" got me curious to click on it, even though I already knew what it was.


Why not keep it as a (perhaps less prominent) alternative?


Just tried it on your site. It went easily enough. I entered my Gmail address in the Persona form, then it had me pick which Google account to use (strange that it wouldn't just choose the one for the Gmail address I entered), then it said I was signed in.


The gmail case is special, actually. For a few domains (gmail, yahoo, not sure which others) it will fall back to a flow that's more like OAuth. But for unknown domains it sends you an email with a link, and then requires you to create a new account (with a new password) that is persona-specific.


I agree. Though facebook tends to track users, people are so used to see the facebook login button that they feel comfortable with it. Persona, though really good, feels different and makes the me a bit uncomfortable as an end user.


How do you know users clearly prefer it?


We track clicks on each of the four login methods, and compare it with successful sign in events with each of them. So we know completion rates for each type, plus which types are preferred by users. Nothing fancy, just google analytics events + checking the users table.


Very cool! Excited to see what comes next.

Is there anything people looking forward to these features should know to plan? Like, I was about to start doing some load balancing stuff soon. Is that a waste of my time and I should just work on the core logic and figure load balancing is just going to be made easier in the near future and I shouldn't waste my time with it? Or should I roll my own option for now and then just figure at some point I should be able to transition to something that performs better in the future?


This is obviously not super precise, but the CPU pegs (load average 1.0, 100% usage) once we get above ~9k messages/second. I don't know how much overhead there is on socket in terms of bandwidth, but some back of the envelope math makes it seem like we're well below network bandwidth on a 100mbit interface. My instinct is that (per rauchg's comment above) that if those kinds of improvements can be made, then we're not yet butting up against a bandwidth issue yet.

As for comparing to other systems, I think we're certainly at the same order of magnitude, but some rough tests a friend of mine ran made it seem like a java server/client pair could send about 4 times as many packets per second. But that's a somewhat unfair comparison - socket.io offers a lot more than simply raw packet transmission and that additional abstraction is going to come at a performance cost. But I think being on the same order of magnitude as a really low level approach like that is a good place to be at this level of maturity of the technology.


When I say it compares in performance to other systems I mean other high level more-than-a-socket systems like XMPP servers, which I spent a lot of years benchmarking and optimizing. Of course to-the-metal binary packet systems will be much faster as you're doing the buffer management and choosing your copies wisely.

It would be super-awesome to see a profile of node.js when you run this test to see where the CPU is going. Is it something low level like the string parsing, buffer copying, socket writing, or maybe something higher level like some inefficient algorithm somewhere. As I read below that socket.io is focusing on performance now, I guess we'll know soon. :)


I'd love to know too, but I haven't had much luck getting profiling to work super well with node. I think I mostly just don't know what I'm doing and someone who did could answer this pretty quickly, but I haven't been particularly successful and figuring out exactly how that cpu time is being spent.


(disclosure: one of the authors)

We're definitely not interested in business success in any sense. In some ways, anti-success is more interesting to us as researchers because academics tend to study communities that are successful, popular, and self-sustaining (like Wikipedia, Twitter, Facebook, etc). That introduces a bunch of funny biases in our understanding of how identity and persistence work in online communities: we've only really studied sites that have pseudonyms or real names and have searchable archives.

To the extent that we do care about "success", it is useful to be able to say "4chan has TONS OF POSTS" because it's a way of saying that the site does, in some sense, work for quite a few people. If it was just a quirky design but wasn't tremendously high traffic (I think the comparison we make in the paper is that /b/ alone has 16 times more daily posts than the 8 biggest news groups on Usenet), it would be a lot harder to get something like this published. People would be more inclined to brush it off and say "well, it's the internet - you can get a few thousand people to use anything." But that kind of traffic is hard to deny. Plus 4chan's role in the web's collective unconscious makes for a great story, too. So bottom line, what researchers choose to study is very strategic, but the metrics that matter definitely aren't about money, but there are lots of other factors.


The thing is, cp includes self shots of people under 18. There is TONS of that on /b/, of both genders. It's awfully hard to tell 19 from 17 when you're taking a topless picture of yourself in the mirror.


Nobody, mods included, cares about the borderline cases where the subject may or may not be 18. They care about the pictures of intercourse with 9 year olds.


Meh, I don't really consider all the camwhores on /b/ and the new /soc/ to be cp really. It's when they start to have really small or no tits and larger head to body ratio sizes - 12ish and under - that you're in cp territory. This is true cp, and what the real btards on places like 7chan indulge in. The current race on /b/ is full of the last waves of summerfag and other general newfag cancer. There was a time a lot of actual cp was posted on /b/, but that times' long gone.


Unfortunately, this isn't just a question of interpretation. Legally, there's no differentiation between those two categories. One is more ethically problematic than the other for sure, but you're just as screwed having either on your computer. You can see this most clearly in sexting cases against high school students. You can still get on the sex offender registry for sending a picture of yourself to your boyfriend or girlfriend and there's tons of stuff like that on /b/.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: