Results

Tasks#

La Perf is a collection of reproducible tests and community-submitted results for :

Embeddings — ✅ Ready (sentence-transformers, IMDB dataset)#
sts models:
modernbert-embed-base
LLM inference — ✅ Ready (LM Studio and Ollama, Awesome Prompts dataset)#
llm models:
LM Studio: gpt-oss-20b
- macOS: mlx-community/gpt-oss-20b-MXFP4-Q8 (MLX MXFP4-Q8)
- Other platforms: lmstudio-community/gpt-oss-20b-GGUF (GGUF)
Ollama: gpt-oss-20b
VLM inference — ✅ Ready (LM Studio and Ollama, Hallucination_COCO dataset)#

vlm models:
LM Studio: Qwen3-VL-8B-Instruct
- macOS: lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit (MLX 4-bit)
- Other platforms: lmstudio-community/Qwen3-VL-8B-Instruct-GGUF-Q4_K_M (Q4_K_M)
Ollama: qwen3-vl:8b
- all platforms: qwen3-vl:8b (Q4_K_M)
Diffusion image generation — 📋 Planned#
Speach to Text - 📋 Planned (whisper)#
Classic ML — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)#

Note For mac-users: If it's possible prefer to use lmstudio with mlx backend, which gives 10-20% more performance then gguf. If you run ollama (by default benchmarks runs both lmstudio and ollama) then you'll see a difference between mlx and gguf formats.

The MLX backend makes the benchmark harder to maintain, but it provides a more realistic performance view, since it’s easy to convert a safetensors model into an mlx x-bit model.

Benchmark Results#

Last Updated: 2025-11-14

Device	Platform	GPU	VRAM	Emb RPS P50	LLM TPS P50 (lms)	LLM TPS P50 (ollama)	VLM TPS P50 (lms)	VLM TPS P50 (ollama)	GPU Power P50	CPU Power P50	Emb Efficiency (RPS/W)	LLM Efficiency (TPS/W) lms	LLM Efficiency (TPS/W) ollama	VLM Efficiency (TPS/W) lms	VLM Efficiency (TPS/W) ollama
ASUSTeK COMPUTER ASUS Vivobook Pro N6506MV	🐧 Linux	NVIDIA GeForce RTX 4060 Laptop GPU	8 GB	162.2	15.4	16.0	22.4	13.6	18.3 W	-	8.88	0.84	0.88	1.23	0.74
Mac16,6	🍏 macOS	Apple M4 Max (32 cores)	shared with system RAM	55.8	56.5	61.0	51.5	47.8	11.7 W	1.1 W	4.77	4.84	5.22	4.40	4.09
Mac16,6 (on battery)	🍏 macOS	Apple M4 Max (32 cores) (on battery)	shared with system RAM	53.9	55.3	62.2	49.0	46.5	11.3 W	1.1 W	4.79	4.91	5.52	4.35	4.13
OpenStack Nova 26.0.7-1 A100 40GB	🐧 Linux	NVIDIA A100-PCIE-40GB	39 GB	453.6	-	113.5	-	108.0	218.2 W	-	2.08	-	0.52	-	0.50
OpenStack Nova A100 80GB	🐧 Linux	NVIDIA A100 80GB PCIe	79 GB	623.8	-	135.5	-	121.2	230.5 W	-	2.71	-	0.59	-	0.53
OpenStack Nova RTX3090	🐧 Linux	NVIDIA GeForce RTX 3090	24 GB	349.5	-	114.8	-	105.3	345.6 W	-	1.01	-	0.33	-	0.30
OpenStack Nova RTX4090	🐧 Linux	NVIDIA GeForce RTX 4090	24 GB	643.6	-	148.7	-	130.4	282.5 W	-	2.28	-	0.53	-	0.46
OpenStack Nova Tesla T4	🐧 Linux	Tesla T4	15 GB	133.7	-	41.5	-	32.6	68.9 W	-	1.94	-	0.60	-	0.47

RPS - Requests Per Second (embeddings throughput)

TPS - Tokens Per Second (generation speed)

W - Watts (power consumption)

Efficiency metrics (RPS/W, TPS/W) are calculated using GPU power consumption

Power Metrics#

Device	CPU Usage (p50/p95)	RAM Used GB (p50/p95)	VRAM Used GB (p50/p95)	GPU Usage (p50/p95)	GPU Temp (p50/p95)	Battery (start/end/Δ)	Duration	GPU Power (p50/p95)	CPU Power (p50/p95)
ASUSTeK COMPUTER ASUS Vivobook Pro N6506MV	24.2% / 25.7%	10.8GB / 13.2GB	7.0GB / 7.2GB	16.0% / 41.0%	64.0°C / 66.0°C	99.0% / 100.0% / -1.0%	2h 8m	18.3W / 44.8W	N/A
Mac16,6	4.0% / 12.0%	22.3GB / 23.9GB	10.7GB / 14.5GB	97.0% / 100.0%	N/A	85% / 85% / +0.0%	42m 56s	11.7W / 32.3W	1.1W / 2.2W
Mac16,6 (on battery)	4.1% / 10.8%	21.4GB / 24.5GB	11.5GB / 14.6GB	96.0% / 100.0%	N/A	85% / 29% / +56.0%	44m 32s	11.3W / 30.5W	1.1W / 2.3W
OpenStack Nova 26.0.7-1 A100 40GB	23.4% / 32.0%	5.4GB / 6.2GB	12.0GB / 13.6GB	77.0% / 85.0%	59.0°C / 66.0°C	N/A	16m 44s	218.2W / 256.2W	N/A
OpenStack Nova A100 80GB	8.7% / 11.3%	5.6GB / 6.3GB	12.0GB / 13.6GB	86.0% / 90.0%	52.0°C / 55.0°C	N/A	14m 38s	230.5W / 274.4W	N/A
OpenStack Nova RTX3090	17.9% / 22.2%	4.9GB / 5.6GB	11.7GB / 13.2GB	82.0% / 86.0%	62.0°C / 62.0°C	N/A	15m 10s	345.6W / 348.7W	N/A
OpenStack Nova RTX4090	17.5% / 20.9%	4.8GB / 5.6GB	11.8GB / 13.5GB	84.0% / 89.0%	57.0°C / 60.0°C	N/A	13m 12s	282.5W / 331.8W	N/A
OpenStack Nova Tesla T4	14.7% / 16.7%	3.8GB / 4.4GB	10.7GB / 12.4GB	95.0% / 96.0%	49.0°C / 49.0°C	N/A	44m 32s	68.9W / 71.5W	N/A

Note

For devices with unified memory (e.g. Apple Silicon), VRAM usage represents the portion of shared RAM allocated to the GPU — it does not indicate a separate dedicated memory pool as on discrete GPUs.

Duration shows the total monitoring time during benchmark execution.

Embeddings#

Text Embeddings (3000 IMDB samples)#

RPS = Rows Per Second — number of text samples encoded per second.

Device	Model	RPS (mean ± std)	Time (s) (mean ± std)	Embedding Dim	Batch Size
ASUSTeK COMPUTER ASUS Vivobook Pro N6506MV	nomic-ai/modernbert-embed-base	162.17 ± 0.61	18.50 ± 0.07	768	32
Mac16,6	nomic-ai/modernbert-embed-base	55.81 ± 0.75	53.76 ± 0.72	768	32
Mac16,6 (on battery)	nomic-ai/modernbert-embed-base	53.93 ± 3.78	55.82 ± 4.07	768	32
OpenStack Nova 26.0.7-1 A100 40GB	nomic-ai/modernbert-embed-base	453.58 ± 2.09	6.61 ± 0.03	768	32
OpenStack Nova A100 80GB	nomic-ai/modernbert-embed-base	623.81 ± 1.30	4.81 ± 0.01	768	32
OpenStack Nova RTX3090	nomic-ai/modernbert-embed-base	349.50 ± 0.97	8.58 ± 0.02	768	32
OpenStack Nova RTX4090	nomic-ai/modernbert-embed-base	643.55 ± 2.16	4.66 ± 0.02	768	32
OpenStack Nova Tesla T4	nomic-ai/modernbert-embed-base	133.71 ± 1.22	22.44 ± 0.20	768	32

Embeddings Performance Profile

Throughput comparison for different embedding models across hardware. Higher values indicate better performance.

Embeddings Efficiency

Embeddings efficiency (RPS/W) across devices. Higher values indicate better performance per watt.

LLMs#

LLM Inference (10 prompts from awesome-chatgpt-prompts)#

LM STUDIO

Device	Model	TPS P50	TPS P95	TTFT P50 (s)	TTFT P95 (s)	TG P50 (s)	TG P95 (s)	Latency P50 (s)	Latency P95 (s)	Input Tokens (total avg)	Output Tokens (total avg)
ASUSTeK COMPUTER ASUS Vivobook Pro N6506MV	openai/gpt-oss-20b	15.36 ± 0.10	16.81 ± 0.17	3.12 ± 0.07	6.36 ± 0.07	0.93 ± 0.13	65.72 ± 0.98	6.15 ± 0.15	69.19 ± 0.87	1728	4024
Mac16,6	openai/gpt-oss-20b	56.53 ± 1.65	77.21 ± 1.99	0.92 ± 0.02	1.23 ± 0.03	0.24 ± 0.00	17.09 ± 0.57	1.28 ± 0.04	18.28 ± 0.60	1728	3906
Mac16,6 (on battery)	openai/gpt-oss-20b	55.34 ± 0.91	78.55 ± 0.97	0.90 ± 0.01	1.18 ± 0.02	0.24 ± 0.00	17.56 ± 0.19	1.22 ± 0.02	18.67 ± 0.20	1728	3982

OLLAMA

Device	Model	TPS P50	TPS P95	TTFT P50 (s)	TTFT P95 (s)	TG P50 (s)	TG P95 (s)	Latency P50 (s)	Latency P95 (s)	Input Tokens (total avg)	Output Tokens (total avg)
ASUSTeK COMPUTER ASUS Vivobook Pro N6506MV	gpt-oss:20b	16.03 ± 0.04	16.43 ± 0.02	35.68 ± 13.48	158.11 ± 0.38	4.53 ± 0.05	74.99 ± 1.27	59.90 ± 0.02	199.34 ± 0.39	1728	13054
Mac16,6	gpt-oss:20b	61.03 ± 4.29	63.50 ± 6.07	4.18 ± 0.31	56.83 ± 0.82	0.46 ± 0.04	25.17 ± 0.33	4.64 ± 0.35	79.54 ± 0.91	1728	12890
Mac16,6 (on battery)	gpt-oss:20b	62.19 ± 3.33	66.18 ± 5.45	10.95 ± 1.08	48.79 ± 1.11	1.74 ± 0.11	29.83 ± 2.93	22.61 ± 0.77	55.19 ± 1.84	1728	14932
OpenStack Nova 26.0.7-1 A100 40GB	gpt-oss:20b	113.51 ± 1.74	119.83 ± 0.78	1.92 ± 0.01	31.23 ± 15.21	0.56 ± 0.00	11.08 ± 0.85	5.24 ± 0.09	35.87 ± 15.85	1728	13042
OpenStack Nova A100 80GB	gpt-oss:20b	135.49 ± 0.36	141.08 ± 0.38	1.58 ± 0.01	26.31 ± 12.50	0.48 ± 0.01	9.41 ± 0.67	4.40 ± 0.01	30.23 ± 12.96	1728	13042
OpenStack Nova RTX3090	gpt-oss:20b	114.83 ± 0.13	119.78 ± 0.46	3.24 ± 0.03	9.86 ± 0.04	0.24 ± 0.00	10.64 ± 0.07	5.30 ± 0.01	19.43 ± 0.09	1728	8926
OpenStack Nova RTX4090	gpt-oss:20b	148.69 ± 0.54	153.80 ± 0.24	2.69 ± 0.02	13.65 ± 0.04	0.26 ± 0.00	8.68 ± 0.03	6.25 ± 0.12	18.51 ± 0.08	1728	11979
OpenStack Nova Tesla T4	gpt-oss:20b	41.49 ± 0.23	42.17 ± 0.07	13.07 ± 2.85	52.33 ± 15.42	0.85 ± 0.11	35.54 ± 4.21	15.88 ± 1.51	84.51 ± 10.32	1728	12683

LLM E2E Latency Performance

End-to-End Latency P50 - Lower is better. Measures full request-to-response time.

LLM Throughput Performance

Token Generation per second (TPS) - Higher is better. Measures token generation speed.

LLM Efficiency

LLM inference efficiency (TPS/W) by backend. Higher values indicate better performance per watt.

VLMs#

VLM Inference (10 questions from Hallucination_COCO)#

LM STUDIO

Device	Model	TPS P50	TPS P95	TTFT P50 (s)	TTFT P95 (s)	TG P50 (s)	TG P95 (s)	Latency P50 (s)	Latency P95 (s)	Input Tokens (total avg)	Output Tokens (total avg)
ASUSTeK COMPUTER ASUS Vivobook Pro N6506MV	qwen/qwen3-vl-8b	22.43 ± 0.08	23.20 ± 0.55	0.75 ± 0.05	0.84 ± 0.05	22.24 ± 0.03	31.98 ± 0.10	23.03 ± 0.06	32.65 ± 0.10	290	5129
Mac16,6	qwen/qwen3-vl-8b	51.47 ± 1.30	53.62 ± 1.82	1.58 ± 0.01	1.77 ± 0.07	9.62 ± 0.48	13.42 ± 0.37	11.24 ± 0.48	15.06 ± 0.30	310	5949
Mac16,6 (on battery)	qwen/qwen3-vl-8b	48.95 ± 2.10	53.44 ± 5.07	1.63 ± 0.02	1.82 ± 0.07	10.66 ± 1.12	13.86 ± 1.22	12.36 ± 1.18	15.52 ± 1.26	310	5956

OLLAMA

Device	Model	TPS P50	TPS P95	TTFT P50 (s)	TTFT P95 (s)	TG P50 (s)	TG P95 (s)	Latency P50 (s)	Latency P95 (s)	Input Tokens (total avg)	Output Tokens (total avg)
ASUSTeK COMPUTER ASUS Vivobook Pro N6506MV	qwen3-vl:8b	13.60 ± 0.08	14.12 ± 0.06	54.82 ± 5.26	72.83 ± 0.45	58.42 ± 1.03	83.23 ± 0.56	109.44 ± 6.02	152.33 ± 1.20	1814	14933
Mac16,6	qwen3-vl:8b	47.78 ± 4.93	49.61 ± 6.79	15.29 ± 1.24	27.64 ± 0.60	16.28 ± 0.91	19.59 ± 1.52	33.09 ± 3.44	44.33 ± 0.41	1814	15577
Mac16,6 (on battery)	qwen3-vl:8b	46.46 ± 0.24	51.22 ± 8.10	16.29 ± 0.84	18.49 ± 0.62	17.14 ± 0.16	21.68 ± 0.57	32.78 ± 1.95	38.83 ± 1.10	1814	15516
OpenStack Nova 26.0.7-1 A100 40GB	qwen3-vl:8b	108.03 ± 0.17	108.57 ± 0.57	7.09 ± 0.01	11.59 ± 0.70	6.97 ± 0.03	9.42 ± 0.01	14.03 ± 0.02	19.40 ± 0.46	1814	16212
OpenStack Nova A100 80GB	qwen3-vl:8b	121.16 ± 0.23	121.55 ± 0.26	6.26 ± 0.01	10.34 ± 0.70	6.25 ± 0.03	8.43 ± 0.02	12.53 ± 0.03	17.34 ± 0.52	1814	16212
OpenStack Nova RTX3090	qwen3-vl:8b	105.30 ± 0.42	105.65 ± 0.28	7.44 ± 0.09	11.97 ± 0.06	7.59 ± 0.28	9.05 ± 0.02	14.17 ± 0.11	19.59 ± 0.06	1814	15940
OpenStack Nova RTX4090	qwen3-vl:8b	130.42 ± 0.25	130.94 ± 0.22	5.53 ± 0.01	8.99 ± 0.01	5.85 ± 0.07	7.29 ± 0.01	10.97 ± 0.21	15.24 ± 0.03	1814	15258
OpenStack Nova Tesla T4	qwen3-vl:8b	32.63 ± 0.01	32.74 ± 0.06	23.38 ± 0.04	32.20 ± 0.02	23.17 ± 0.10	32.79 ± 0.04	46.78 ± 0.51	63.59 ± 0.05	1814	15737

VLM E2E Latency Performance

End-to-End Latency P50 - Lower is better. Measures full request-to-response time.

VLM Throughput Performance

Token Generation per second (TPS) - Higher is better. Measures token generation speed.

VLM Efficiency

VLM inference efficiency (TPS/W) by backend. Higher values indicate better performance per watt.

_All metrics are shown as mean ± standard deviation across 3 runs.

Results

Tasks#

Embeddings — ✅ Ready (sentence-transformers, IMDB dataset)#

LLM inference — ✅ Ready (LM Studio and Ollama, Awesome Prompts dataset)#

VLM inference — ✅ Ready (LM Studio and Ollama, Hallucination_COCO dataset)#

Diffusion image generation — 📋 Planned#

Speach to Text - 📋 Planned (whisper)#

Classic ML — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)#