▶ code ▼ output ▶ uv-logs | Cell: nvidia_dump | deps: torch | 33.33s | Raw
NVIDIA GPU Information: Mon Sep 15 16:41:01 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L4 Off | 00000000:38:00.0 Off | 0 | | N/A 46C P0 28W / 72W | 1MiB / 23034MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA L4 Off | 00000000:3A:00.0 Off | 0 | | N/A 46C P0 28W / 72W | 1MiB / 23034MiB | 2% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA L4 Off | 00000000:3C:00.0 Off | 0 | | N/A 49C P0 31W / 72W | 1MiB / 23034MiB | 2% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA L4 Off | 00000000:3E:00.0 Off | 0 | | N/A 48C P0 29W / 72W | 1MiB / 23034MiB | 2% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
▶ UV Install Logs
▶ code ▼ output ▶ uv-logs | Cell: utils | deps: torch, numpy | 31.77s | Raw
▶ UV Install Logs
▶ code ▼ output ▶ uv-logs | Cell: config | deps: torch, numpy | 37.88s | Raw
Configuration: Experts: 128 Hidden size: 1152 Top-k: 4 Batch size: 8 Sequence length: 512 Device: cuda Dtype: bfloat16
▶ UV Install Logs
▼ code ▼ output ▶ uv-logs | Cell: save_data | deps: torch, numpy | 44.56s | Raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
"""Generate and save shared weights for consistent comparison."""
import torch
import numpy as np
from pathlib import Path

# Model configuration
NUM_EXPERTS = 128
HIDDEN_SIZE = 1152
INTERMEDIATE_SIZE = 3072
TOP_K = 4

# Input configuration
BATCH_SIZE = 1
SEQ_LEN = 100
DTYPE = "float32"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Seeds for reproducibility
WEIGHT_SEED = 999
EXPERT_SEED = 777
INPUT_SEED = 123
GENERAL_SEED = 42

def set_seed(seed: int):
    """Set seeds for reproducibility."""
    torch.manual_seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

# Generate shared weights for all implementations
print("Generating shared weights...")

# Router weights
set_seed(WEIGHT_SEED)
router_weight = torch.empty(NUM_EXPERTS, HIDDEN_SIZE)
torch.nn.init.kaiming_uniform_(router_weight)
router_bias = torch.zeros(NUM_EXPERTS)

# Expert weights - using proper dimensions for gate/up combined projection
set_seed(EXPERT_SEED)
gate_up_proj = torch.empty(NUM_EXPERTS, HIDDEN_SIZE, 2 * HIDDEN_SIZE).normal_(mean=0.0, std=0.02)
gate_up_proj_bias = torch.zeros(NUM_EXPERTS, 2 * HIDDEN_SIZE)
down_proj = torch.empty(NUM_EXPERTS, HIDDEN_SIZE, HIDDEN_SIZE).normal_(mean=0.0, std=0.02)
down_proj_bias = torch.zeros(NUM_EXPERTS, HIDDEN_SIZE)

# Save weights
torch.save(router_weight, 'router_weight.pt')
torch.save(router_bias, 'router_bias.pt')
torch.save(gate_up_proj, 'gate_up_proj.pt')
torch.save(gate_up_proj_bias, 'gate_up_proj_bias.pt')
torch.save(down_proj, 'down_proj.pt')
torch.save(down_proj_bias, 'down_proj_bias.pt')

print(f"Saved weights:")
print(f"  Router: {tuple(router_weight.shape)}")
print(f"  Gate/Up proj: {tuple(gate_up_proj.shape)}")
print(f"  Down proj: {tuple(down_proj.shape)}")
print(f"  Hidden size: {HIDDEN_SIZE}")
Generating shared weights... Saved weights: Router: (128, 1152) Gate/Up proj: (128, 1152, 2304) Down proj: (128, 1152, 1152) Hidden size: 1152
▶ UV Install Logs

GPT-OSS Implementation

This section benchmarks the GPT-OSS MoE implementation in non-training mode.

▶ code ▼ output ▶ uv-logs | Cell: gptoss_run | deps: torch, numpy | 43.29s | Raw
Configuration: Experts: 128 Hidden size: 1152 Top-k: 4 Batch size: 8 Sequence length: 512 Device: cuda Dtype: bfloat16 Loading weights from: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e Files in directory: [PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/down_proj.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/down_proj_bias.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/stderr.txt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/gate_up_proj.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/gate_up_proj_bias.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/result.json'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/stdout.txt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/router_weight.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/router_bias.pt')] Loaded shared weights from artifacts Router weight sum: 12.588732 Gate/up sum: 1026.601807 Down sum: 206.729263 === GPT-OSS Implementation === Router weight sum: 12.562500 Gate/up proj sum: 1024.000000 Down proj sum: 207.000000 Average time: 62.308 ms Throughput: 65737 tokens/sec Memory allocated: 1.330 GB Memory increase: 0.380 GB Output sum: -4.968750
▶ UV Install Logs

Artifacts:

gptoss_results.json

MegaBlocks Implementation

This section benchmarks the MegaBlocks MoE implementation.

▶ code ▼ output ▶ uv-logs | Cell: megablocks_run | deps: torch, numpy, kernels | 49.81s | Raw
Configuration: Experts: 128 Hidden size: 1152 Top-k: 4 Batch size: 8 Sequence length: 512 Device: cuda Dtype: bfloat16 Loading weights from: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e Loaded shared weights from artifacts Router weight sum: 12.588732 Gate/up sum: 1026.601807 Down sum: 206.729263 === MegaBlocks Implementation === [MegaBlocks] Router weight sum: 12.562500 [MegaBlocks] Gate/up projection shape: (128, 1152, 2304), sum: 1024.000000 [MegaBlocks] Down projection shape: (128, 1152, 1152), sum: 207.000000 Average time: 26.933 ms Throughput: 152084 tokens/sec Memory allocated: 2.243 GB Memory increase: 1.292 GB Output sum: -4.968750
▶ UV Install Logs
Fetching 66 files: 0%| | 0/66 [00:00<?, ?it/s] Fetching 66 files: 2%|▏ | 1/66 [00:00<00:18, 3.49it/s] Fetching 66 files: 26%|██▌ | 17/66 [00:01<00:03, 15.86it/s] Fetching 66 files: 100%|██████████| 66/66 [00:01<00:00, 57.94it/s]

Performance Comparison

This section loads the benchmark results and creates visualizations comparing the two implementations.

▶ code ▼ output ▶ uv-logs | Cell: visualization | deps: matplotlib | 3.33s | Raw
Loading benchmark results from: GPT-OSS dir: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/fc17d5998a27217e1676a638ddeceb18cab662c6e9b30c9a62218784604c9a26 MegaBlocks dir: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/6e3545a8e3c2ca65ca800a7e1c1824fded11e28258efcd83355514bb0646e166 Loading results from: GPT-OSS: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/fc17d5998a27217e1676a638ddeceb18cab662c6e9b30c9a62218784604c9a26/gptoss_results.json MegaBlocks: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/6e3545a8e3c2ca65ca800a7e1c1824fded11e28258efcd83355514bb0646e166/megablocks_results.json GPT-OSS results keys: ['avg_time_ms', 'throughput_tokens_per_sec', 'memory_allocated_gb', 'memory_cached_gb', 'memory_increase_gb', 'device', 'dtype', 'tokens', 'warmup_iters', 'timing_iters'] MegaBlocks results keys: ['avg_time_ms', 'throughput_tokens_per_sec', 'memory_allocated_gb', 'memory_cached_gb', 'memory_increase_gb', 'device', 'dtype', 'tokens', 'warmup_iters', 'timing_iters'] Extracted metrics: Times (ms): [62.308485079556704, 26.93254135781899] Throughputs: [65737.43519474348, 152083.67994618745] Memory usage (GB): [1.329831600189209, 2.2425241470336914] Memory increase (GB): [0.3795137405395508, 1.2922062873840332] ============================================================ PERFORMANCE COMPARISON SUMMARY ============================================================ Metric GPT-OSS MegaBlocks Winner ------------------------------------------------------------ Latency (ms) 62.31 26.93 MegaBlocks Throughput (tok/s) 65737 152084 MegaBlocks Memory Usage (GB) 1.330 2.243 GPT-OSS Memory Increase (GB) 0.380 1.292 GPT-OSS MegaBlocks is 2.31x faster MegaBlocks has 2.31x higher throughput ============================================================
▶ UV Install Logs

Conclusion

This focused benchmark compares the GPT-OSS (non-training mode) and MegaBlocks MoE implementations on the same hardware with identical weights and inputs. The comparison focuses on:

  1. Latency: Average forward pass time
  2. Throughput: Tokens processed per second
  3. Memory Usage: GPU memory consumption
  4. Memory Efficiency: Memory increase during execution

Both implementations use:
- 64 experts with top-2 routing
- 768 hidden dimensions
- Batch size of 8, sequence length of 512
- bfloat16 precision
- Identical pre-generated weights for fair comparison

The results show the performance characteristics of each approach, helping identify the optimal implementation for different use cases.