Correct me if I'm wrong on the following, as I have not yet used Metaflow (only ...

seeravikiran · on Dec 4, 2019

Yes - you should be able to use dask the way you say.

Your first part of the understanding matches my expectation too. Dask single box parallelism achieved by multi processing - akin to parallel map. And distributed compute is achieved by shipping the work to remote substrates.

For your second comment - we leverage pickle mostly to keep it easy and simple for basic types. For small dataframes we just pickle for simplicity. For larger dataframes we rely on users to directly store the data (probably encoded as parquet) and just pickle the path instead of the whole dataframe.

savin-goyal · on Dec 4, 2019

Yes, you can use any python library inside a metaflow step.