If anyone from the dolt team is reading this, I'd like to make an enquiry: At bu...

zachmu · on March 7, 2021

We'll be in touch :)

st1ck · on March 8, 2021

You can just try converting it into Parquet with ZSTD compression (and set high enough level). You can even fiddle with dictionary settings: https://arrow.apache.org/docs/python/generated/pyarrow.parqu...

st1ck · on March 8, 2021

Dataset similar to yours (github data) which you can query using Clickhouse: https://gh.clickhouse.tech/explorer/

zomglings · on March 6, 2021

This is the dataset on Kaggle - https://www.kaggle.com/simiotic/github-code-snippets

justinclift · on March 7, 2021

Yeah, that sized database is likely to be a challenge unless the computer system it's running on has scads of memory.

One of my projects (DBHub.io) is putting effort towards working through the problem with larger sized SQLite databases (~10GB), and that's mainly through using bare metal hosts with lots of memory. eg 64GB, 128GB, etc.

Putting the same data into PostgreSQL, or even MySQL, would likely be much more efficient memory wise. :)

zachmu · on March 7, 2021

We have 200 GB databases in dolt format that are totally queryable. They don't work well querying on the web though - you need a local copy to query it effectively. Making web query as fast as local is an ongoing project.

justinclift · on March 7, 2021

Yeah the "on the web" piece is the thing we're talking about. :)

200GB databases for PostgreSQL (etc) isn't any kind of amazing.

tomcam · on March 7, 2021

Uh, personal question here. Where does your ~10G number come from? I pretty much run my life on the Apple Notes app. My Notes database is about 12G and now I’m scared.

justinclift · on March 9, 2021

Oh, it's just the vast majority of SQLite databases we see are pretty tiny. eg a few MB's at max

So, resource usage is pretty much not a consideration in any way when doing things with them.

Once people start doing things with multi-GB databases though, if the system they're running on has a small amount of memory (say 4GB, 8GB) then things can start going poorly.

Not like "crash and burn" poorly (so far). More like "tries to read 10GB of data into 4GB of ram" poorly. eg Dog slow, not lightning quick.

If your machine has a bunch of ram in it, you're likely safe . Though, I'd be making damn sure there are tested and working backups of it, just to be safe. :)

tomcam · on March 21, 2021

I missed this answer the first time around. Thanks so much for the reply. I always get machines with the maximum amount of RAM precisely for this reason.

zomglings · on March 7, 2021

Can't beat SQLite for distribution as a public dataset, though.

zachmu · on March 7, 2021

We think dolt can :)

justinclift · on March 7, 2021

How do you send someone a dolt database as a file?

411111111111111 · on March 7, 2021

Probably by first pushing it into a file. This command is in the readme.

  dolt remote add <remote> file:///Users/xyz/abs/path/

zachmu · on March 7, 2021

Push it to DoltHub, tell them to clone it. Just like with source code.

acidbaseextract · on March 7, 2021

How do you send someone a git repository as a file? Why would a tarball not work?

touisteur · on March 7, 2021

Git bundles. Amazing simple tool. Doing it all the time via sneaker net, or mail/chat-with-attachments, better that sending patches around IME.

StreamBright · on March 7, 2021

You have other options too. If I have time i can try to reduce the size with a columnar format that is designed for this use case (repeated values, static dataset).

zomglings · on March 7, 2021

That would be really great. Let me know if there's any way we can help. Maybe if we released a small version of the dataset for testing/benchmarking and then I could take care of running the final processing on the full dataset?

StreamBright · on March 7, 2021

That would be amazing. I get back my internet tomorrow and i can play with the dataset see how much we could optimize.

zomglings · on March 7, 2021

Hi StreamBright - just published the development version of the dataset also to Kaggle: https://www.kaggle.com/simiotic/github-code-snippets-develop...

Compressed, it's 471 MB. Uncompressed, just a little more than 3 GB.

If you want to get in touch with me in a better way than HN comments two good options:

1. My email is in my profile

2. You can direct message me (@zomglings) on the Bugout community Slack: https://join.slack.com/t/bugout-dev/shared_invite/zt-fhepyt8...

Looking forward to collaborating with you. :)

StreamBright · on March 8, 2021

I have messaged you on Slack.

The preliminary tests show significant reduction in space usage when using Parquet over Sqlite3. This is not unexpected at all. Parquet is much better for analytical use.

- 2.9G Mar 8 08:37 snippets-dev.db

- 427M Mar 8 14:05 test1.parquet

(venv) snippets-dev sqlite3 -header -csv -readonly snippets-dev.db 'SELECT COUNT() FROM snippets;' COUNT() 4850000

(venv) snippets-dev python test.py test1.parquet 4850000

I can share the 3 lines of Python and 1 line for SQL I used to convert the dataset.

There was only weird thing that I encountered, you somehow have a new line character in the commit_hash field for every value.