By: Thomas Stahura
Compute is king in the age of AI. At least, that's what big tech wants you to believe. The truth is a little more complicated.
When you boil it down, AI inference is simply a very large set of multiplications. All computers do this kind of math all the time, so why can't any computer run a LLM or diffusion model?
It's all about scale. Model scale is the number of parameters (tunable neurons) in a model. Thanks to platforms like Hugging Face, developers now have access to very well performing open source models at every scale. From the small models like moondream2 (1.93b), and llama 3.2 (3b), to medium range ones like phi-4 (14b), and then the largest models like bloom (176b). These models can run on anything from a Raspberry pi to an A100 GPU server.
Sure, the smaller models take a performance hit, but only by 10-20% on most benchmarks. I got llama 3.2 (1b) to flawlessly generate and run a snake game in python. So why, then, do most developers rely on big tech to generate their tokens? The short answer is speed in performance.
Models at the largest scale (100b+ like gpt4o and the such) perform best and cost the most. That will probably be true for a long time but maybe not forever. In my opinion, it would be good if everyone could contribute their compute to collectively run models at the largest scale.
I am by no means the first person to have this idea.
Folding@home, launched October 2000 as a first-of-its-kind distributed computing project, aimed at simulating protein folding. The project reached its peak in 2020 during the pandemic, achieving 2.43 exaflops of compute by April of that year. That made it the first exaflop computing system ever.
This also exists in the generative AI community. Petals, a project made by BigScience (the same team behind bloom 176b), enables developers to run and finetune their large model in a distributed fashion. (Check out the live network here.) Nous Research has its DisTrO system (distributed training over the internet). (Check its status here.) And there are plenty of others like hivemind and exo.
While there are so many examples of distributed compute systems, none have taken off for the reason that it's too difficult to join the network.
I’ve done some experimenting, and I think a solution to this could be using the browser to join the network and running inference using webllm in pure javascript. I will write more about my findings, so stay tuned.
If you are interested in this topic, email me! Thomas @ ascend dot vc