Please seed our torrents! If someone could contribute an automated torrent tracker that shows which torrents are most in need of seeding, we'd incredibly appreciate that (all our code is open source).
I’ve tried to dig a bit through the libgen fiction archive. Surprising to me, I found that the vast majority was Romance novels. I’ve since heard this from others too that most published books are Romance.
If this collection is trying to solve the problem of having large tagged and sorted data for training, even if it can’t be commercialized, it might also be introducing another where the data still needs to be weighted and filtered. Not to knock on Romance specifically but even training on all scientific papers ever written, whether peer-reviewed, highly cited, retracted, debunked, etc. might explain some hallucinations it ends up making.
And not just romance in the mainstream Harlequin mode -- there's huge amounts of basically text porn there which shows up often in searches for something else. I didn't know this was still being written -- I thought film and Internet video had killed off the "dirty book" decades ago.
I don’t think it’s as much about day dreaming but more so about garbage in, garbage out. We know LLMs perform way better when you use high quality input data [0]. Having your input data be skewed towards one genre isn’t great. And thats not even getting into the quality of these books.
Just went and donated them. Nice to have such products: science research papers are always under paywalls, but sometimes just "ordinary" people want to read that to wide their scope and knowledge. Knowledge should not be paywalled like that. Also there are some situations sometimes that some books are just not available in some countries. In that cases such products can be used I think. In other cases - sure we need to support authors(sometimes publishers/resellers) directly if we can.