You do not have to cast pandas DataFrames when using scikit-learn, for many year...

wenc · on April 3, 2023

Yes that support is still not complete. When you pass a Pandas dataframe into Scikit you are implicitly doing df.values which loses all the dataframe metadata.

There is a library called sklearn-pandas which doesn’t seem to be mainstream and dev has stopped since 2022.

westurner · on April 3, 2023

From pandas-dataclasses #166 "ENH: pyarrow and optionally pydantic" https://github.com/astropenguin/pandas-dataclasses/issues/16... :

> What should be the API for working with pandas, pyarrow, and dataclasses and/or pydantic?

> Pandas 2.0 supports pyarrow for so many things now, and pydantic does data validation with a drop-in dataclasses.dataclass replacement at pydantic.dataclasses.dataclass.

Model output may or may not converge given the enumeration ordering of Categorical CSVW columns, for example; so consistent round-trip (Linked Data) schema tool support would be essential.

CuML is scikit-learn API compatible and can use Dask for distributed and/or multi-GPU workloads. CuML is built on CuDF and CuPY; CuPy is a replacement for NumPy arrays on GPUs with 100x relative performance.

CuPy: https://github.com/cupy/cupy :

> CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.

https://cupy.dev/ :

> CuPy is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.

> The figure shows CuPy speedup over NumPy. Most operations perform well on a GPU using CuPy out of the box. CuPy speeds up some operations more than 100X. Read the original benchmark article Single-GPU CuPy Speedups on the RAPIDS AI Medium blog

CuDF: https://github.com/rapidsai/cudf

CuML: https://github.com/rapidsai/cuml :

> cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.*

> cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

> For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.

FWICS there's now a ROCm version of CuPy, so it says CUDA (NVIDIA only) but also compiles for AMD. IDK whether there are plans to support Intel OneAPI, too.

What of the non-Arrow parts of other pandas-compatible and not pandas-compatible DataFrame libraries can be ported back to Pandas (and R)?