Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Anna's Archive – LLM Training Data from Shadow Libraries (annas-archive.org)
219 points by all2 on Oct 19, 2023 | hide | past | favorite | 19 comments


God bless you Anna. Also, how do we make sure this/ mirrors of it can't be taken down?


They actually have a "premium" membership you can subscribe to.


Please seed our torrents! If someone could contribute an automated torrent tracker that shows which torrents are most in need of seeding, we'd incredibly appreciate that (all our code is open source).


First time I've encountered this Anna's Archive. Is this just one person doing all the work or a group?

It seems similar to libgen, actually it mentions mirroring libgen. Doesn't it get targeted by publishing company lawyers and such?


I highly recommend giving his/her blog post [0] a read.

[0] https://annas-blog.org/how-to-run-a-shadow-library.html


Fantastic read thanks for sharing that


I’ve tried to dig a bit through the libgen fiction archive. Surprising to me, I found that the vast majority was Romance novels. I’ve since heard this from others too that most published books are Romance.

If this collection is trying to solve the problem of having large tagged and sorted data for training, even if it can’t be commercialized, it might also be introducing another where the data still needs to be weighted and filtered. Not to knock on Romance specifically but even training on all scientific papers ever written, whether peer-reviewed, highly cited, retracted, debunked, etc. might explain some hallucinations it ends up making.


And not just romance in the mainstream Harlequin mode -- there's huge amounts of basically text porn there which shows up often in searches for something else. I didn't know this was still being written -- I thought film and Internet video had killed off the "dirty book" decades ago.


These are called smut now and most of the readerbase is female. I believe it's often regarded a more "socially-aligned"/less hardcore porn substitute.

https://www.urbandictionary.com/define.php?term=Smut


if anything internet made textual dirty stories explode as well.

It's just a different target audience. Especially women don't like the highly visual stuff. They tend to favor a relaxing text.


How did you miss Fifty Shades of Grey?


humans daydream about romance, why can't computers?


I don’t think it’s as much about day dreaming but more so about garbage in, garbage out. We know LLMs perform way better when you use high quality input data [0]. Having your input data be skewed towards one genre isn’t great. And thats not even getting into the quality of these books.

[0] https://arxiv.org/abs/2306.11644


Humans don't daydream about dating a computer, why should a computer daydream about dating a human? ;)


Because that will sell.


Just went and donated them. Nice to have such products: science research papers are always under paywalls, but sometimes just "ordinary" people want to read that to wide their scope and knowledge. Knowledge should not be paywalled like that. Also there are some situations sometimes that some books are just not available in some countries. In that cases such products can be used I think. In other cases - sure we need to support authors(sometimes publishers/resellers) directly if we can.


Thank you!! Really appreciate it.



No way!?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: