I looked at FastAPI while back for an API for a machine learning inference servi...

jeffnappi · on Feb 1, 2021

Ideally ML tasks would get executed by background tasks e.g. Celery and use polling or a non-blocking event loop. It's generally not preferred to run such intensive process work inside the context of a web request.

TheGuyWhoCodes · on Feb 1, 2021

It really depends on your API clients and your latency. Celery is fine for long running or batching, the queue and the fact you need to store and retrieve the results from somewhere else isn't really ideal. Non blocking events loops (and reactive requests) really like small tasks and even scream at you if you are above their thresholds.

Honestly I see a bright future for apache arrow flight but I think It's a bit immature right now.

But my main points is that 80% of the blog posts about FastAPI are for machine learning and people will just copy the code, run one or 2 requests see that it's fine and move along...

m_ke · on Feb 2, 2021

I also ran into some performance issues with model serving using FastAPI and ended up using Ray Serve to handle properly distributing the workload and batching requests. With a bit of work I was able to 10x the throughput and cut down response time in half.

Ray uses apache arrow plasma store to avoid the copy and serialization costs that usually come with multiprocessing in python.

https://docs.ray.io/en/master/serve/index.html

TheGuyWhoCodes · on Feb 2, 2021

Thanks, I looked at this way back before v1 and had a lot of issues. I guess I need to retest it.

Most of the models I use can be converted to a more optimized formats for inference like https://treelite.readthedocs.io/en/latest/ so the code and interfaces are pretty similar and it makes the architecture less complex to update and deploy.

m_ke · on Feb 2, 2021

To be honest it's still kind of a mess. I started out with a FastAPI service that wasn't performant enough and figured that putting ray serve behind it would take a day but it turned out to be a pain in the ass. The docs are really lacking and they're transitioning from flask to starlette in 1.2 (which is not out yet) so a lot of the information is wrong or misleading. I ran into a ton of random serialization issues like simple dataclasses holding a torch tensor and some plain metadata getting pickled and copied, and pydantic models failing to serialize and crashing the whole thing with assertion errors (broken recently in their nighly releases, which are needed to use 1.2).

Depending on string names and handles is also very hacky and completely cripples PyCharm autocomplete / type checking.

The core of the project is great though. We're working on a video processing system that has a ton of downstream models and using Ray makes it possible to get away with running inference in pytorch without going down the TensorRT/inference engine rabbit hole (would be torture for some of our detection and sequential models).

For anyone looking into it I recommend starting with their Architecture Doc(https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2...) and Design Patterns (https://docs.google.com/document/d/167rnnDFIVRhHhK4mznEIemOt...) Docs.