Hacker Newsnew | past | comments | ask | show | jobs | submit | adminprof's commentslogin

I think you're missing two fatal problems in this "publish all raw data and code" mindset. I don't think the desire of commercialization is high on the list of fatal problems preventing people from publishing data+software.

1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy? Healthcare, web activity, finances. Sure you can try to anonymize it, anonymization is imperfect, and even fully anonymized data can be joined to other data sources to de-identify people; k-anonymity only works in a closed ecosystem. If we live in a world where search engine companies don't publish their research because of this constraint, that seems worse than the current system.

2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?


> 1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy?

that's an interesting problem that i have not thought about.

i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

one defines it by building a specialized system for the purpose of reproducible research computing. i would envision this as a sort of distributed abstract virtual machine and source code packaging standard where the entire environment that was used to process the data is packaged and shipped with the paper. the success of this system would depend on the designers getting it right such that researchers _wouldn't_ have to worry about weird systems level kludges like docker. as it would behave as a hermetically sealed virtual machine (or cluster of virtual machines), there would be no concerns about bitrot unless one needed to make changes or build a new image based on an existing one.

the good news is that most data processing and simulation code is pretty well suited to this sort of paradigm. often it just does cpu/gpu computations and file i/o. internet connectivity or outside dependencies are pretty much out of scope.

i don't think it's hard... there just hasn't been the will or financial backing to build this out right and therefore it does not exist.


> i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

As someone who wants science to advance, I want highly trusted researchers to be able to do studies that involve my private, personal data, that I would not consent to being public and linked back to me.

It is highly important to me that we allow these studies to not use open data.

A great example of this is the US college scorecard, which uses very private tax returns to measure how much college degrees and majors contribute to income (not the only value of college education, but certainly an important one):

https://collegescorecard.ed.gov/

Only high degrees of trust allowed this data to be published on extremely private information, and I think that makes for a better world. I am pro-open data, but research on non-open data should absolutely exist.

For instance, should any research about mental health for transgender people be abolished? Because anything on that subject is not going to be open, or at the least those who would be open to their data being public are a probably non-representative subset.


> get express informed consent that indicates that their data will be public and that it could be linked back to them.

10~20 years ago I could see it. Nowadays it’s a tough ask that would severly limit the number of people participating. This could also steer away most minority groups, which would make the research not only limited, but also misleading (we’d still draw conclusions from them, and decide policies accordingly, even as they come from grossly biased participant pools)

Aside from just the public aspect of having ones data in the open, there is also second/third order discoveries that would happen from there (e.g knowing someone’s cooking habits could be enough to deduce overall health status, potentially chronic illness, ethnicity/religion, relationship status etc.)


It does exist. It's called GNU Guix.


> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

This is always a problem even with some of the most open scientific code.

Requiring that the code be published, and perhaps a peer-reviewer to run it just once with a bit of support to ensure that the submitters aren't completely bullshitting, before the paper gets approved to be published, might be a good start.


From my experience in the digital health sector, concerns for privacy is always the reason given for not sharing anything valuable and/or useful to others. But it's just a convenient way of hiding the 'desire of commercialisation'.


this is also true, and it also runs within science itself. if someone spends two years collecting some data that is very hard to collect and it has a few papers worth of insights within it, they're going to want to keep that data private until they can get those papers out themselves lest someone else come along, download their data and scoop them before they have a chance to see the fruits of their hard labor.

while it's not great for science at large, i don't blame them either.


It's solvable if publishing the dataset counts as a paper, and citations of the dataset which should be required count as citations for e.g. tenure.

For example, ImageNet for machine learning is a very expensive and difficult data set to produce that has resulted in revolutionary advances in machine learning. And people build models on it, cite their results as evidence their models are good, and cite the paper.


This is an interesting idea. Although I am afraid that publishing a dataset, even a good one, will not be considered "real science" by our (broken) institutions.


You have a valid point here. It's probably utopian, but to me the only reasonable answer to this is to acknowledge that science is a collective process. Of course, this goes against the (stupid) idea that some extremely deserving geniuses are the ones that make science...


It seems like it's just that the goals are different. Computer science researchers could also make the opposite case, "The underwhelming impact of software development on producing state-of-the-art computer science discoveries."


I think your comment is pretty spot on. Just three things I'd add to, are that 1) the advisor is not intentionally downgrading your letter to "just good", but it's that they're obligated to write a better than "just good" letter for someone who has been more productive both in terms of research and in being a leader in the community (these two things go hand in hand). Writing the same letter for two students regardless of what they accomplish would be unfair.

And 2) sometimes getting more funding is simply not possible, as in the advisor has basically reached the limit of what they can do. There's a limit to the number of proposals that one can submit and the number of calls that fit their research agenda. So what I think you're missing is that if an advisor has less funding, there's going to be more pressure to finish sooner and less freedom to explore ideas beyond what's written in a previous grant proposal.

3) I've never heard of a tenured professor that concerned about their publication rate. In fact, most of them don't even update their CVs or websites with the last few years of papers. It's always the student who is trying to get more papers.


I'd like to see in a forum for dentists a thread where they offer clever suggestions for how to change software development. Clearly, all these dentists have used plenty of software before.

They would have seemingly great ideas like "programmers should get 10% of their pay docked each time I encounter a bug in a program" or "the real solution is to hire someone to test the program from start to finish before releasing it".

I bet programmers on Hacker News would be livid upon hearing these suggestions, but seem to have no problem announcing their clever solutions about other disciplines (not excluding myself).


Terrible analogy. What authority might a team of QA testers have on restructuring the methods for software development? Perhaps not as much as a software developer, but certainly enough to grope at and possibly conceive of new and useful ideas. A student is a QA tester for a classroom, they are not a dentist using a program, they were involved in the entire process and saw basically every mechanism with varying levels of ignorance on motivations.

Comparing a student to an end product consumer of software development is an embarrassingly long stretch - teaching is an art, but its not that complicated nor disconnected from its patrons.


Students are users of a class, as much as a dentist is a user of a dental software application. Neither were involved in the making of the class or software. Both were delivered the experience as designed by the course staff (instructor and/or TAs), or software team. Both only see the final product after months or years of development, done by specialists trained for many years even before that.

The error you're making is stating that "students are involved in the entire process" which is laughable. Many classes have gone through years of iteration, and even new courses take many months to develop before students set foot in the classroom, not to mention the years of experience and education needed to get the instructor to the point that they can even make a class in several months.


The only way you can refute is by being obtuse. QA testers and dentists using software are not the same, not at all - that was the premise for my whole point and you refuted it by pretending I didn't say it.

I assume you are a teacher, so what's truly laughable is that I must explain this to you: teaching students is at least a factor more intimate than delivering a finished piece of software or other commodity to a user - its also a factor less complicated to deliver a MVP. If we are talking textbooks, then you're objections apply - I would hope teaching in your mind occupies a distinct space from textbooks.

Surely many students displeased with instruction will be ignorant with some nuances and limitations of structuring a class, but the gap between good prescription and naive wishes is not a half career's worth of experience when it comes to teaching. I'm sorry to burst your bubble: Teaching is not rocket science, it is not software engineering - the prestige a teacher earns comes from either their qualifications in an advanced field or their effectiveness in imparting knowledge (or perhaps their proclivity to allow cheating depending on how you measure).

There are surely many teachers instructing arithmetic that understand teaching better than doctorate professors. Are you going to tell me that a piece of writing outlining common pitfalls, an earnest reader, and a creative mind really get you no further than a fresh, apathetic track-following graduate when it comes to innovation? Get real.

If teaching is the only thing that makes you good at teaching then I guess we should ignore all sharing of information about it, even from those who are good at it. Well, or we could just ignore your sentiments about who should be allowed to innovate in public.


Simply, because using absolute points would lead to debate over every single decimal of a point. A student with 89.95% would be determined to fight to get 90.05% and then continue to fight to get 90.15% so it only makes the problem worse.


That's not my experience. There were no A-F grade boundaries in any of the schools I've ever attended. Students only ever cared about the one boundary that still remained: minimum passing grade. I saw exactly one student trying to get a teacher to round up a 99% grade to 100% just so he could say he had a perfect grade, nobody else cared about decimals unless they were in danger of failing a class.

Boundaries are the problem. Standardized testing for post-graduate positions in my country would give students bonus points based on their overall academic performance. Full bonus points for grades > 90%, half points for grades > 80%... Of course students wanted to optimize that number as much as possible and I don't really blame them.


This doesn't work when I've tried it. How many students or times have you implemented this policy? First, it doesn't make sense when the regrade is most objective (like points were calculated wrong, or the grader didn't see something that the student wrote). And if you say that it doesn't apply for straightforward grading mistakes, then you get emails asking you whether something is a grading mistake or has the chance of lowering a grade.

And I've tried this policy before, and got students who wrote in my course evals something like "the professor intentionally tries to scare students from asking for regrades by threatening to lower their grade even more." And then what about when you are still asked for a regrade (which in my experience was not zero, but maybe about a third or half as much as without this policy). In those cases, you end up doing way way more, so the level of effort actually increases.


He seems to have good intentions, but does not seem to have knowledge of IRB which may make this situation worse.

Specifically, he confuses "does not constitute human subjects research" with "exemption" which is a pretty big difference and anyone who works with human subjects should know this.

From his Twitter thread, "Update: They are now saying they have an exemption. They have not made any forms available or explained the lack of informed consent."

Exemptions are protocols that have been reviewed, and deemed exempt based on one of 8 very specific criteria. Studies deemed not constituting human subjects research are returned by the IRB, and not considered reviewed.

Given that the authors actually said "...to the Princeton University Institutional Review Board, which determined that our study does not constitute human subjects research" this is clearly NOT an exemption, and informed consent is not a consideration as far as the IRB is concerned.


This is actually the most technically correct answer on this page. Everyone is going by their own opinions about definitions of what constitutes human subjects research, rather than starting from the primary sources. IRB guidelines are dictated by the federal government "common rule", a common standard adopted by all institutions that receive federal funding.

"about whom" is a key criteria from the federal government to determine whether something fits the definition of human subjects research. Here's a quote from HHS:

"The phrase ‘about whom’ is important. A human subject is the person that the information is about, not necessarily the person providing the information. In the case of biospecimens, the human subject is the person from whom the specimen was taken."

https://www.hhs.gov/ohrp/sites/default/files/OHRP-HHS-Learni...

Reading that, it's clear that the Princeton study does not fit the definition of human subjects research. The complainants may be able to sue for damages to the university, but not because the study was improperly classified as human subjects.


The bit you've quoted is intended to clarify that "about whom" means the subject is the patient, even if the researcher gets the information indirectly through the patient's doctor. Earlier in the document you linked, it states:

> If for the purpose of a research study [...] An investigator [...] interacts with a living individual, [...] Then The research likely involves human subjects.

What's up for debate here is whether this research qualifies for one of the specific exemptions in the regulation. The general definition in the regulation is broad enough to include all interaction with living humans that produces information used for the study, and is only narrowed by later enumerated exemptions.


Not at all,

1) this is clearly not an exempt study, which is a category of its own that the IRB reviews and makes a judgment on. The authors would immediately have been able to point out the protocol number of the exempt study if it were exempt. Rather it's not considered human subjects as the authors clearly state on their FAQ.

2) it seems like you're thrown off by the example, because if you ended your sentence as "The bit you've quoted is intended to clarify that "about whom" means the subject is the patient" then we would be in agreement, and it'd be more obvious that the subject is, in fact, the website's policies/procedures. Here's an excerpt from the written text of the common rule,

"“About whom” – a human subject research project requires the data received from the living individual to be about the person."

https://hso.research.uiowa.edu/defining-human-subjects


> this is clearly not an exempt study, which is a category of its own that the IRB reviews and makes a judgment on. The authors would immediately have been able to point out the protocol number of the exempt study if it were exempt. Rather it's not considered human subjects as the authors clearly state on their FAQ.

Please don't use such circular logic. We're debating whether the research properly qualifies as human subject research; we're not debating about what the IRB actually decided on that question, because they may have gotten it wrong.

> then we would be in agreement, and it'd be more obvious that the subject is, in fact, the website's policies/procedures.

The policy itself is certainly the intended subject of the research. But the methods they've chosen mean they are also collecting and analyzing information about the responses of real live humans to their interactions and interventions, and that qualifies this as human subject research irrespective of the naive intentions of the researchers. Having a non-human subject does not preclude also having a human subject.


> Please don't use such circular logic. We're debating whether the research properly qualifies as human subject research; we're not debating about what the IRB actually decided on that question, because they may have gotten it wrong.

Yes that's what we're debating. But you used the word "exemption" which has a specific technical meaning in human subjects research, and I'm saying that it's not an exemption. There are 8 tests for exemption, and I'm pointing out that this is not an IRB exemption.

> The policy itself is certainly the intended subject of the research. But the methods they've chosen mean they are also collecting and analyzing information about the responses of real live humans to their interactions and interventions, and that qualifies this as human subject research irrespective of the naive intentions of the researchers. Having a non-human subject does not preclude also having a human subject.

Do you have a source for this interpretation? It sounds like this is your interpretation, but not the federal one. Following your interpretation, surveys of companies (e.g. emailing contact@company.com to ask how many employees they have) would fall under the definition of human subjects.

Thanks for the continued conversation, but I think this is my last comment. Nothing personal, but this is a bit exhausting. It seems like you're debating two other people on this forum about this exact definition, and you might consider that maybe you're just wrong about your interpretation?

Here's one final source, if it helps provide closure:

To meet the definition of human subjects, you must ask “about whom” questions. Questions about your respondents' attitudes, opinions, preferences, behavior, experiences, or characteristics, are all considered “about whom” questions. Questions about an organization, a policy, or a process are “about what” questions.

https://campusirb.duke.edu/resources/guides/defining-researc...


> Do you have a source for this interpretation? It sounds like this is your interpretation, but not the federal one. Following your interpretation, surveys of companies (e.g. emailing contact@company.com to ask how many employees they have) would fall under the definition of human subjects.

Sure. Click through the NIH's Decision Tool [1], and you'll find that collecting information only through surveys or interviews leads to the tool saying "Your study is most likely considered exempt from the human subject's regulations, category 2 (Exemption 2)." That particular exemption requires that the research qualify under at least one of three further criteria. (I'll also note that for someone who complained about people not referring to primary sources, you seem to be citing more .edu sources than .gov sources.)

Furthermore, this particular research unquestionably went beyond mere surveys and interviews. Legal threats under false pretenses are way outside those bounds. So even if a mere survey about how many employees a company has doesn't qualify as human research (which I'm willing to concede), that doesn't help settle the question about this research.

[1] https://grants.nih.gov/policy/humansubjects/hs-decision.htm


Information about the people contacted was collected and analyzed, it's a study fundamentally about how they react to this email, it is not (just) about a third party. In the case of websites run by a single individual (such as OP) there is no third party at all, but in all cases information about the first party was being collected and analyzed.

To be a bit more pithy, here is one example of such an analysis (admittedly, I'm not sure comments on twitter count): https://twitter.com/RossTeixeira/status/1471249559879929861


Splitting hairs like this may be a useful endeavor in a court of law (or at least ethics committee), but it sidetracks the "real" question: should a university fund research that essentially sends phishing emails en masse and entraps people into admitting they breached the law, incurring legal costs, and/or causing panic among the recipients?


This comes up every time there's an article about academic publishing. Yes peer reviewers do the reviewing, but it's the long-term infrastructure and coordination that the journal provides.

AirBnB's content is generated by users, but AirBnB itself requires software development, legal, customer support, HR, program managers, quality control, etc. Same with publishers.

Note that this journal now has a publishing fee for authors to cover these costs, rather than a fee for the reader as before. The 2022 fee for each author is $1,705 according to the FAQ. So moving to open access it not about removing the costs (which many people on Hacker News seems to always assume), but changing who pays for it.


The coordination and most of the infrastructure (except for archival) is actually also performed by us peer reviewers... We set up the conference's websites, we find the committees, the reviewers, we distribute the work, we do the reviews, we organize the meetings, we create the instructions for formatting, editing, publishing, etc.

We require that some institution is there milking us and the institutions so that we can collect stamps to get to the next level, making everybody waste resources and then finding ourselves not being to access our own publications for free! sometimes requiring (for formal applications) extra paper copies at a high cost!


A few things:

- AirBnB serves a significantly-larger infrastructure than a website hosting up PDFs. Github with Jekyll can do it.

- Software development is really at a minimum for journals. Hosting can be a static blog, and review infrastructure could literally be replaced with e-mailing PDFs / text files back and forth, and often devolves to that anyway.

- The editor of the JFP uses their university and external grant budgets to cover most, if not all, of their operating time and expenses.

- Legal is likely the largest cost (ensuring the journal has sole publication rights, and contracts to that effect), but open access can also simplify this.

- Without customers (e.g., open access), there is no need for customer support. HR and program management is a very small minimum, as well.

- Nobody involved in the actual journal work (editing, reviewing, etc.) is paid.

The cost of $1075 is, frankly, kind of absurd. What does it cost to host a PDF online forever? Volume 31 of the JFP published ~25 research articles, which would be ~$25k. When Github has the infrastructure to entirely eat the hosting cost, what justifies this much money?


So I think your assumption that academic publishing is "a website hosting up PDFs" is the source of confusion for "what justifies this much money." Just like AirBnB isn't just a website hosting JPEGs, while being worth 129 billion dollars. Hosting PDFs can already be done by arXiv or Google Drive or Github as you said.

Customer support is for peer reviewers who can't log into their account, for managing issues dealing with misconduct, for handling issues with payments, for post-publishing corrections and errata, for passing accounts from editors who become non-responsive to other editors, etc. Not just dealing with readers or subscribers.


As someone who has been on both sides of journal peer review, let me assure you that there is no "customer support" framework like what you are describing. The model simply doesn't work that way. The primary editor of the JFP, in particular, plays an active role and individually manages feedback and reviewer corralling. They are not paid for that service. When submitting to the JFP, my feedback was hand-delivered by the editor via email.


Sure, as someone who has also been on both sides, that's not what I was describing at all though. None of the examples I gave are handled by an editor, except maybe "managing issues dealing with misconduct" depending on the situation, but maybe by the publisher's legal team. But the other issues are not handled by an editor in my experience.


Before Airbnb there was Couchsurfing. I think academic publishers are a better argument against the unnecessity of Airbnb than Airbnb is an argument for the necessity of academic publishers.


The devils are in the details. This does not still justify current model and the steep price they are paid (by taxpayers, FYI a typical university pay millions of dollar to these institutions). Consider openreview by ICLR[1] as an counter example for your claim. The maintenance, quality control, software development is not as steep as current gatekeepers advertise yet their review quality is much higher (network/transparency effect). Not to mention the profit margin of scientific publication is 3x of the Apple (38% for Elsevier) and it should be because the science workers, works free for these institutions (fighting for credit).

The current business model as a whole is a legacy institution based on earlier monopoly by a charlatan named Maxwell [2]. He basically lured scientist by shiny hotels+extra packages to build the initial reputation and then monopolize the entire industry for decades. It's interesting how the model works by rip off the taxpayers twice (by publishing and access) while still peer-review process is free (from money, not credit).

You can find a good review of this scheme from below YouTube video[3].

[1] https://openreview.net/group?id=ICLR.cc/2022/Conference

[2] https://en.wikipedia.org/wiki/Robert_Maxwell

[3] https://www.youtube.com/watch?v=PriwCi6SzLo


AirBnB is useful because it helps people wanting to rent their homes for short periods to people willing to rent them, which otherwise would be harder for both sides; without out, people would have more trouble traveling on a budget or making spare money from their homes. The service they're providing is a marketplace for the two parties to meet and communicate. I don't think this translates that well for journals; academics aren't getting paid by the readers of the articles published in the journals, and readers wouldn't have much trouble finding papers on arbitrary websites due to search engine nowadays. Moreover, renting your home to AirBnB once doesn't mean you're forbidden to let a family friend use it for a week next summer, but many journals forbid authors from sharing their papers for free.


I would like to subscribe to your newsletter.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: