LLM Benchmarking – databloom.net

I've been experimenting with different types and sizes of LLMs solving different problems. It's fascinating how models can perform so differently on the same task.

I started some basic benchmarking, which you can try out at apps.databloom.net/ai_benchmark. Right now these are all chat-tuned models via API — I may add some others over time.

As always, all code is on GitHub.