Metrics#

This section describes how La Perf calculates and evaluates metrics across different benchmark tasks.

Embeddings#

Overview#

Embedding benchmarks use the sentence-transformers library for encoding operations.

Metric	Description	Unit
E2E Latency	Total time to encode full dataset	seconds
RPS	Rows Per Second (throughput)	rows/s

Measurement Methodology#

The total encoding latency is measured around the .encode() call, which internally handles batching. Each run includes device synchronization before and after encoding to ensure accurate timing.

Implementation details: - Uses torch.cuda.synchronize() for NVIDIA GPUs - Uses torch.mps.synchronize() for Apple Silicon GPUs - Ensures complete device-side execution before measurement

Cross-Run Statistics#

For multiple benchmark runs, simple mean and standard deviation are calculated:

mean(run1, run2, run3) ± std(run1, run2, run3)

# Example with RPS:
# final_mean_rps = mean([run1_rps, run2_rps, run3_rps])
# final_std_rps = std([run1_rps, run2_rps, run3_rps])
# On table you see: final_mean_rps ± final_std_rps

Note: Embeddings use direct mean/std across runs, not percentile-based statistics.

LLMs & VLMs#

Overview#

Metric	Description	Unit
TTFT	Time To First Token — prompt processing latency	seconds
TG	Token Generation — time spent generating output	seconds
TPS	Tokens Per Second — generation throughput	tokens/s
E2E Latency	End-to-end request latency	seconds

Measurement Methodology#

Streaming & Token Counting#

La Perf uses streaming APIs (Ollama, LM Studio via OpenAI SDK) to measure both latency and throughput.

Critical distinction: API chunks ≠ tokens

The server sends responses in chunks, but each chunk may contain multiple tokens. Token counts are obtained from server-side usage statistics.

Per-Request Measurements#

For each prompt in the benchmark:

Timestamp	Description
`t0_stream`	Request start time
`first_token_ts`	First chunk received (≈ first token)
`t1_stream`	Response complete

Token Count	Source
`input_tokens`	From server usage stats
`output_tokens`	From server usage stats

Metric Calculations#

Metric	Formula	Notes
E2E Latency	`t1_stream - t0_stream`	Total request time
TTFT	`first_token_ts - t0_stream`	Prompt processing time
TG	`t1_stream - first_token_ts`	Generation phase time
TPS	`output_tokens / E2E Latency`	Client-side throughput metric

Why TPS = output_tokens / E2E Latency?#

Incorrect approach:

TPS = output_tokens / TG  # ❌ WRONG
# Example: 38 tokens / 0.0007s = 52 285.714 tokens/sec

Fifty-two thousand tokens per second? Goodbye H100, my local PC just destroyed you!

Yeah, no. This calculation is hilariously wrong.

This vastly overestimates performance because TG measures only the time between first and last chunk, not the actual token generation time.

Correct approach:

TPS = output_tokens / E2E Latency  # ✅ CORRECT
# Example: 38 tokens / 0.6668s = 56.988 tokens/sec

This reflects real-world throughput from the client perspective.

Limitation: For very short outputs (1-2 chunks), TG may not accurately represent generation time. Server-side metrics would be more precise but are not currently collected.

Per-Metric Percentiles#

For each metric across all requests, La Perf computes:

Percentile	Description
P25	25^th percentile
P50	Median
P75	75^th percentile
P95	95^th percentile

Cross-Run Statistics#

For multiple benchmark runs, statistics are calculated from percentile values across runs:

mean(run1_percentile, run2_percentile, run3_percentile) ± std(run1_percentile, run2_percentile, run3_percentile)

# Example with P50 TPS:
# final_p50_tps = mean([run1_p50_tps, run2_p50_tps, run3_p50_tps])
# final_p50_tps_std = std([run1_p50_tps, run2_p50_tps, run3_p50_tps])
# On table you see: final_p50_tps ± final_p50_tps_std

Note: LLM/VLM compute percentiles per run first, then aggregate across runs. This differs from Embeddings which use direct mean/std.

These aggregated values appear in the results tables.

Power Metrics#

Overview#

La Perf monitors system resource usage and power consumption during benchmarks. Power metrics are collected continuously throughout benchmark execution.

Metric	Description	Unit
GPU Power	GPU power consumption	watts
CPU Power	CPU power consumption (macOS only)	watts
GPU VRAM Used	GPU memory used	MB
GPU VRAM Total	Total GPU memory available	MB
GPU Utilization	GPU compute utilization	%
GPU Temperature	GPU temperature	°C
CPU Utilization	CPU utilization across all cores	%
RAM Used	System RAM used by process	GB

Measurement Methodology#

Data Collection#

Power metrics are sampled continuously during benchmark execution: - Sampling interval: 1 second (configurable) - Background monitoring: Runs in parallel with benchmark workload - Platform-specific tools: - NVIDIA GPUs: nvidia-smi for power, utilization, memory, temperature - macOS (Apple Silicon): sudo powermetrics for GPU/CPU power + ioreg for GPU stats - CPU/RAM: psutil library for cross-platform monitoring

macOS powermetrics#

On macOS, when --use-sudo-powermetrics flag is enabled, the benchmark collects: - GPU Power: Via powermetrics --samplers gpu_power - CPU Power: Via powermetrics --samplers cpu_power

The powermetrics process runs in the background and outputs to a log file, which is parsed after benchmark completion.

Percentile Statistics#

For each metric, La Perf computes percentiles across all samples collected during the benchmark:

Percentile	Description
P50	Median value (50^th percentile)
P95	95^th percentile (high load indicator)

Example: If 2673 power samples were collected over 2672 seconds: - gpu_watts_p50 = 68.92W means 50% of samples were ≤68.92W - gpu_watts_p95 = 71.47W means 95% of samples were ≤71.47W

Metrics Breakdown#

Power Consumption#

gpu_watts_p50/p95: GPU power draw (NVIDIA: from nvidia-smi, macOS: from powermetrics)
cpu_watts_p50/p95: CPU power draw (macOS only, requires sudo)

GPU Resource Usage#

gpu_vram_used_mb_p50/p95: GPU memory used by the process
gpu_vram_total_mb_p50/p95: Total GPU memory (should be constant)
gpu_util_percent_p50/p95: GPU compute utilization (0-100%)
gpu_temp_celsius_p50/p95: GPU temperature

System Resource Usage#

cpu_util_percent_p50/p95: CPU utilization across all cores
ram_used_gb_p50/p95: System RAM used by the benchmark process

Cross-Run Statistics#

When running multiple benchmark iterations, power metrics are aggregated:

# For each power metric, compute mean and std across runs
mean(run1_p50, run2_p50, run3_p50) ± std(run1_p50, run2_p50, run3_p50)

# Example with GPU power P50:
# final_gpu_watts_p50 = mean([run1_p50, run2_p50, run3_p50])
# final_gpu_watts_p50_std = std([run1_p50, run2_p50, run3_p50])

Metadata#

Each power metrics result includes: - samples_collected: Total number of samples taken - monitoring_duration_seconds: Total monitoring duration

This allows verifying that sampling worked correctly throughout the benchmark.

Notes#

All timing values are wall-clock times measured via time.perf_counter().
Benchmarks are repeated at least 3 times to compute mean and standard deviation.
All metrics are device-synchronized and exclude warmup runs.