If anyone from the dolt team is reading this, I'd like to make an enquiry:
At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.
The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.
I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.
Yeah, that sized database is likely to be a challenge unless the computer system it's running on has scads of memory.
One of my projects (DBHub.io) is putting effort towards working through the problem with larger sized SQLite databases (~10GB), and that's mainly through using bare metal hosts with lots of memory. eg 64GB, 128GB, etc.
Putting the same data into PostgreSQL, or even MySQL, would likely be much more efficient memory wise. :)
We have 200 GB databases in dolt format that are totally queryable. They don't work well querying on the web though - you need a local copy to query it effectively. Making web query as fast as local is an ongoing project.
Uh, personal question here. Where does your ~10G number come from? I pretty much run my life on the Apple Notes app. My Notes database is about 12G and now I’m scared.
Oh, it's just the vast majority of SQLite databases we see are pretty tiny. eg a few MB's at max
So, resource usage is pretty much not a consideration in any way when doing things with them.
Once people start doing things with multi-GB databases though, if the system they're running on has a small amount of memory (say 4GB, 8GB) then things can start going poorly.
Not like "crash and burn" poorly (so far). More like "tries to read 10GB of data into 4GB of ram" poorly. eg Dog slow, not lightning quick.
If your machine has a bunch of ram in it, you're likely safe . Though, I'd be making damn sure there are tested and working backups of it, just to be safe. :)
I missed this answer the first time around. Thanks so much for the reply. I always get machines with the maximum amount of RAM precisely for this reason.
You have other options too. If I have time i can try to reduce the size with a columnar format that is designed for this use case (repeated values, static dataset).
That would be really great. Let me know if there's any way we can help. Maybe if we released a small version of the dataset for testing/benchmarking and then I could take care of running the final processing on the full dataset?
The preliminary tests show significant reduction in space usage when using Parquet over Sqlite3. This is not unexpected at all. Parquet is much better for analytical use.
At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.
The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.
I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.