The Machine Learning Compilation blog:

Significant progress has been made in the field of generative artificial intelligence and large language models… As it stands, the majority of these models necessitate the deployment of powerful servers to accommodate their extensive computational, memory, and hardware acceleration requirements.

[…]

MLC-LLM [is] a universal solution that takes the ML compilation approach and brings LLMs onto diverse set of consumer devices… To make our final model accelerated and broadly accessible, the solution maps the LLM models to vulkan API and metal, which covers the majority of consumer platforms including windows, linux and macOS… Finally, thanks to WebGPU, we can offload those language models directly onto web browsers. WebLLM is a companion project that leverages the ML compilation to bring these models onto browsers.

Their iOS app is powered by the Vicuna 7B language model. I was genuinely shocked by the inference speed on my iPhone 14 Pro. The response quality is roughly equivalent to MPT, StableLLM, and other similar open source projects—in other words, not particularly great. But, again, all of this is running locally on a phone—that is a truly impressive feat.

One of the example use-cases from the linked announcement is a bespoke AI assistant that is trained on each individual user’s private data. Now, this personalized assistant should run locally for privacy and security reasons but it doesn’t have to be particularly powerful as long as it has the ability to offload difficult tasks to a more powerful, centralized assistant in a privacy preserving manner.

A pattern very similar to Simon Willison’s recent proposal for “Privileged” and “Quarantined” LLMs would be key here.

In this scenario, it is less important for local models to be powerful than it is for them to be fast and energy efficient. MLC could be a step towards making this a reality.