Metrics#
This section describes how La Perf calculates and evaluates metrics across different benchmark tasks.
Embeddings#
Overview#
Embedding benchmarks use the sentence-transformers library for encoding operations.
| Metric | Description | Unit |
|---|---|---|
| E2E Latency | Total time to encode full dataset | seconds |
| RPS | Rows Per Second (throughput) | rows/s |
Measurement Methodology#
The total encoding latency is measured around the .encode() call, which internally handles batching.
Each run includes device synchronization before and after encoding to ensure accurate timing.
Implementation details:
- Uses torch.cuda.synchronize() for NVIDIA GPUs
- Uses torch.mps.synchronize() for Apple Silicon GPUs
- Ensures complete device-side execution before measurement
Cross-Run Statistics#
For multiple benchmark runs, simple mean and standard deviation are calculated:
mean(run1, run2, run3) ± std(run1, run2, run3)
# Example with RPS:
# final_mean_rps = mean([run1_rps, run2_rps, run3_rps])
# final_std_rps = std([run1_rps, run2_rps, run3_rps])
# On table you see: final_mean_rps ± final_std_rps
Note: Embeddings use direct mean/std across runs, not percentile-based statistics.
LLMs & VLMs#
Overview#
| Metric | Description | Unit |
|---|---|---|
| TTFT | Time To First Token — prompt processing latency | seconds |
| TG | Token Generation — time spent generating output | seconds |
| TPS | Tokens Per Second — generation throughput | tokens/s |
| E2E Latency | End-to-end request latency | seconds |
Measurement Methodology#
Streaming & Token Counting#
La Perf uses streaming APIs (Ollama, LM Studio via OpenAI SDK) to measure both latency and throughput.
Critical distinction: API chunks ≠ tokens
The server sends responses in chunks, but each chunk may contain multiple tokens. Token counts are obtained from server-side usage statistics.
Per-Request Measurements#
For each prompt in the benchmark:
| Timestamp | Description |
|---|---|
t0_stream |
Request start time |
first_token_ts |
First chunk received (≈ first token) |
t1_stream |
Response complete |
| Token Count | Source |
|---|---|
input_tokens |
From server usage stats |
output_tokens |
From server usage stats |
Metric Calculations#
| Metric | Formula | Notes |
|---|---|---|
| E2E Latency | t1_stream - t0_stream |
Total request time |
| TTFT | first_token_ts - t0_stream |
Prompt processing time |
| TG | t1_stream - first_token_ts |
Generation phase time |
| TPS | output_tokens / E2E Latency |
Client-side throughput metric |
Why TPS = output_tokens / E2E Latency?#
Incorrect approach:
Fifty-two thousand tokens per second? Goodbye H100, my local PC just destroyed you!
Yeah, no. This calculation is hilariously wrong.
This vastly overestimates performance because TG measures only the time between first and last chunk, not the actual token generation time.
Correct approach:
This reflects real-world throughput from the client perspective.
Limitation: For very short outputs (1-2 chunks), TG may not accurately represent generation time. Server-side metrics would be more precise but are not currently collected.
Per-Metric Percentiles#
For each metric across all requests, La Perf computes:
| Percentile | Description |
|---|---|
| P25 | 25th percentile |
| P50 | Median |
| P75 | 75th percentile |
| P95 | 95th percentile |
Cross-Run Statistics#
For multiple benchmark runs, statistics are calculated from percentile values across runs:
mean(run1_percentile, run2_percentile, run3_percentile) ± std(run1_percentile, run2_percentile, run3_percentile)
# Example with P50 TPS:
# final_p50_tps = mean([run1_p50_tps, run2_p50_tps, run3_p50_tps])
# final_p50_tps_std = std([run1_p50_tps, run2_p50_tps, run3_p50_tps])
# On table you see: final_p50_tps ± final_p50_tps_std
Note: LLM/VLM compute percentiles per run first, then aggregate across runs. This differs from Embeddings which use direct mean/std.
These aggregated values appear in the results tables.
Power Metrics#
Overview#
La Perf monitors system resource usage and power consumption during benchmarks. Power metrics are collected continuously throughout benchmark execution.
| Metric | Description | Unit |
|---|---|---|
| GPU Power | GPU power consumption | watts |
| CPU Power | CPU power consumption (macOS only) | watts |
| GPU VRAM Used | GPU memory used | MB |
| GPU VRAM Total | Total GPU memory available | MB |
| GPU Utilization | GPU compute utilization | % |
| GPU Temperature | GPU temperature | °C |
| CPU Utilization | CPU utilization across all cores | % |
| RAM Used | System RAM used by process | GB |
Measurement Methodology#
Data Collection#
Power metrics are sampled continuously during benchmark execution:
- Sampling interval: 1 second (configurable)
- Background monitoring: Runs in parallel with benchmark workload
- Platform-specific tools:
- NVIDIA GPUs: nvidia-smi for power, utilization, memory, temperature
- macOS (Apple Silicon): sudo powermetrics for GPU/CPU power + ioreg for GPU stats
- CPU/RAM: psutil library for cross-platform monitoring
macOS powermetrics#
On macOS, when --use-sudo-powermetrics flag is enabled, the benchmark collects:
- GPU Power: Via powermetrics --samplers gpu_power
- CPU Power: Via powermetrics --samplers cpu_power
The powermetrics process runs in the background and outputs to a log file, which is parsed after benchmark completion.
Percentile Statistics#
For each metric, La Perf computes percentiles across all samples collected during the benchmark:
| Percentile | Description |
|---|---|
| P50 | Median value (50th percentile) |
| P95 | 95th percentile (high load indicator) |
Example: If 2673 power samples were collected over 2672 seconds:
- gpu_watts_p50 = 68.92W means 50% of samples were ≤68.92W
- gpu_watts_p95 = 71.47W means 95% of samples were ≤71.47W
Metrics Breakdown#
Power Consumption#
gpu_watts_p50/p95: GPU power draw (NVIDIA: fromnvidia-smi, macOS: frompowermetrics)cpu_watts_p50/p95: CPU power draw (macOS only, requires sudo)
GPU Resource Usage#
gpu_vram_used_mb_p50/p95: GPU memory used by the processgpu_vram_total_mb_p50/p95: Total GPU memory (should be constant)gpu_util_percent_p50/p95: GPU compute utilization (0-100%)gpu_temp_celsius_p50/p95: GPU temperature
System Resource Usage#
cpu_util_percent_p50/p95: CPU utilization across all coresram_used_gb_p50/p95: System RAM used by the benchmark process
Cross-Run Statistics#
When running multiple benchmark iterations, power metrics are aggregated:
# For each power metric, compute mean and std across runs
mean(run1_p50, run2_p50, run3_p50) ± std(run1_p50, run2_p50, run3_p50)
# Example with GPU power P50:
# final_gpu_watts_p50 = mean([run1_p50, run2_p50, run3_p50])
# final_gpu_watts_p50_std = std([run1_p50, run2_p50, run3_p50])
Metadata#
Each power metrics result includes:
- samples_collected: Total number of samples taken
- monitoring_duration_seconds: Total monitoring duration
This allows verifying that sampling worked correctly throughout the benchmark.
Notes#
- All timing values are wall-clock times measured via
time.perf_counter(). - Benchmarks are repeated at least 3 times to compute mean and standard deviation.
- All metrics are device-synchronized and exclude warmup runs.