Skip to content

Metrics#

This section describes how La Perf calculates and evaluates metrics across different benchmark tasks.

Embeddings#

Overview#

Embedding benchmarks use the sentence-transformers library for encoding operations.

Metric Description Unit
E2E Latency Total time to encode full dataset seconds
RPS Rows Per Second (throughput) rows/s

Measurement Methodology#

The total encoding latency is measured around the .encode() call, which internally handles batching. Each run includes device synchronization before and after encoding to ensure accurate timing.

Implementation details: - Uses torch.cuda.synchronize() for NVIDIA GPUs - Uses torch.mps.synchronize() for Apple Silicon GPUs - Ensures complete device-side execution before measurement

Cross-Run Statistics#

For multiple benchmark runs, simple mean and standard deviation are calculated:

mean(run1, run2, run3) ± std(run1, run2, run3)

# Example with RPS:
# final_mean_rps = mean([run1_rps, run2_rps, run3_rps])
# final_std_rps = std([run1_rps, run2_rps, run3_rps])
# On table you see: final_mean_rps ± final_std_rps

Note: Embeddings use direct mean/std across runs, not percentile-based statistics.


LLMs & VLMs#

Overview#

Metric Description Unit
TTFT Time To First Token — prompt processing latency seconds
TG Token Generation — time spent generating output seconds
TPS Tokens Per Second — generation throughput tokens/s
E2E Latency End-to-end request latency seconds

Measurement Methodology#

Streaming & Token Counting#

La Perf uses streaming APIs (Ollama, LM Studio via OpenAI SDK) to measure both latency and throughput.

Critical distinction: API chunks ≠ tokens

The server sends responses in chunks, but each chunk may contain multiple tokens. Token counts are obtained from server-side usage statistics.

Per-Request Measurements#

For each prompt in the benchmark:

Timestamp Description
t0_stream Request start time
first_token_ts First chunk received (≈ first token)
t1_stream Response complete
Token Count Source
input_tokens From server usage stats
output_tokens From server usage stats

Metric Calculations#

Metric Formula Notes
E2E Latency t1_stream - t0_stream Total request time
TTFT first_token_ts - t0_stream Prompt processing time
TG t1_stream - first_token_ts Generation phase time
TPS output_tokens / E2E Latency Client-side throughput metric

Why TPS = output_tokens / E2E Latency?#

Incorrect approach:

TPS = output_tokens / TG  # ❌ WRONG
# Example: 38 tokens / 0.0007s = 52 285.714 tokens/sec

Fifty-two thousand tokens per second? Goodbye H100, my local PC just destroyed you!

Yeah, no. This calculation is hilariously wrong.

This vastly overestimates performance because TG measures only the time between first and last chunk, not the actual token generation time.

Correct approach:

TPS = output_tokens / E2E Latency  # ✅ CORRECT
# Example: 38 tokens / 0.6668s = 56.988 tokens/sec

This reflects real-world throughput from the client perspective.

Limitation: For very short outputs (1-2 chunks), TG may not accurately represent generation time. Server-side metrics would be more precise but are not currently collected.


Per-Metric Percentiles#

For each metric across all requests, La Perf computes:

Percentile Description
P25 25th percentile
P50 Median
P75 75th percentile
P95 95th percentile

Cross-Run Statistics#

For multiple benchmark runs, statistics are calculated from percentile values across runs:

mean(run1_percentile, run2_percentile, run3_percentile) ± std(run1_percentile, run2_percentile, run3_percentile)

# Example with P50 TPS:
# final_p50_tps = mean([run1_p50_tps, run2_p50_tps, run3_p50_tps])
# final_p50_tps_std = std([run1_p50_tps, run2_p50_tps, run3_p50_tps])
# On table you see: final_p50_tps ± final_p50_tps_std

Note: LLM/VLM compute percentiles per run first, then aggregate across runs. This differs from Embeddings which use direct mean/std.

These aggregated values appear in the results tables.


Power Metrics#

Overview#

La Perf monitors system resource usage and power consumption during benchmarks. Power metrics are collected continuously throughout benchmark execution.

Metric Description Unit
GPU Power GPU power consumption watts
CPU Power CPU power consumption (macOS only) watts
GPU VRAM Used GPU memory used MB
GPU VRAM Total Total GPU memory available MB
GPU Utilization GPU compute utilization %
GPU Temperature GPU temperature °C
CPU Utilization CPU utilization across all cores %
RAM Used System RAM used by process GB

Measurement Methodology#

Data Collection#

Power metrics are sampled continuously during benchmark execution: - Sampling interval: 1 second (configurable) - Background monitoring: Runs in parallel with benchmark workload - Platform-specific tools: - NVIDIA GPUs: nvidia-smi for power, utilization, memory, temperature - macOS (Apple Silicon): sudo powermetrics for GPU/CPU power + ioreg for GPU stats - CPU/RAM: psutil library for cross-platform monitoring

macOS powermetrics#

On macOS, when --use-sudo-powermetrics flag is enabled, the benchmark collects: - GPU Power: Via powermetrics --samplers gpu_power - CPU Power: Via powermetrics --samplers cpu_power

The powermetrics process runs in the background and outputs to a log file, which is parsed after benchmark completion.

Percentile Statistics#

For each metric, La Perf computes percentiles across all samples collected during the benchmark:

Percentile Description
P50 Median value (50th percentile)
P95 95th percentile (high load indicator)

Example: If 2673 power samples were collected over 2672 seconds: - gpu_watts_p50 = 68.92W means 50% of samples were ≤68.92W - gpu_watts_p95 = 71.47W means 95% of samples were ≤71.47W

Metrics Breakdown#

Power Consumption#

  • gpu_watts_p50/p95: GPU power draw (NVIDIA: from nvidia-smi, macOS: from powermetrics)
  • cpu_watts_p50/p95: CPU power draw (macOS only, requires sudo)

GPU Resource Usage#

  • gpu_vram_used_mb_p50/p95: GPU memory used by the process
  • gpu_vram_total_mb_p50/p95: Total GPU memory (should be constant)
  • gpu_util_percent_p50/p95: GPU compute utilization (0-100%)
  • gpu_temp_celsius_p50/p95: GPU temperature

System Resource Usage#

  • cpu_util_percent_p50/p95: CPU utilization across all cores
  • ram_used_gb_p50/p95: System RAM used by the benchmark process

Cross-Run Statistics#

When running multiple benchmark iterations, power metrics are aggregated:

# For each power metric, compute mean and std across runs
mean(run1_p50, run2_p50, run3_p50) ± std(run1_p50, run2_p50, run3_p50)

# Example with GPU power P50:
# final_gpu_watts_p50 = mean([run1_p50, run2_p50, run3_p50])
# final_gpu_watts_p50_std = std([run1_p50, run2_p50, run3_p50])

Metadata#

Each power metrics result includes: - samples_collected: Total number of samples taken - monitoring_duration_seconds: Total monitoring duration

This allows verifying that sampling worked correctly throughout the benchmark.


Notes#

  • All timing values are wall-clock times measured via time.perf_counter().
  • Benchmarks are repeated at least 3 times to compute mean and standard deviation.
  • All metrics are device-synchronized and exclude warmup runs.