Eclipse CDT approach (recommended)
Install Eclipse CDT + CDT GDB Hardware Debugging plugin. Create a C project, set cross-compiler prefix to riscv64-unknown-elf-. In Debug Configurations → GDB Hardware Debugging: GDB command = riscv64-unknown-elf-gdb, target = localhost:1234.
QEMU side (debug_qemu.sh)
qemu-system-riscv64 -machine virt -nographic \ -bios fw_jump.elf -kernel payload.elf -S -s
-S = pause at start. -s = GDB server on port 1234. Connect Eclipse CDT GDB and step through OpenSBI + payload source simultaneously.
Initial GDB commands
set arch riscv:rv64 add-symbol-file fw_jump.elf 0x80000000 hbreak _fw_start hbreak sbi_init hbreak *0x80200000 continue
-mcmodel=medany is a GCC compiler flag only — not valid for the assembler (as). Remove it from ASFLAGS, keep only in CFLAGS.
# Wrong: ASFLAGS = -g $(ARCH) -mcmodel=medany -mno-relax # Fixed: ASFLAGS = -g $(ARCH)
riscv64-unknown-elf- (bare-metal toolchain) does not support PIE. OpenSBI uses Linux build conventions including PIE. Fix: use the Linux-target toolchain.
sudo apt install gcc-riscv64-linux-gnu binutils-riscv64-linux-gnu
make CROSS_COMPILE=riscv64-linux-gnu- PLATFORM=generic \
FW_JUMP=y FW_JUMP_ADDR=0x80200000 FW_TEXT_START=0x80000000
| Toolchain | Target | PIE | Use for |
|---|---|---|---|
| riscv64-unknown-elf- | Bare metal | No | Your payload, zsbl.S |
| riscv64-linux-gnu- | Linux ABI | Yes | OpenSBI, Linux kernel |
Called the MROM (Mask ROM) — a small read-only region QEMU emulates at 0x1000 (not 0x0). Contains ~8 instructions: a few NOPs, then an auipc+jalr that jumps to 0x80000000 (where OpenSBI/BIOS is loaded). Also contains the DTB pointer in a register.
This is NOT a real-world zero-state bootloader — it is QEMU's simplified substitute for the chip-level boot ROM that real silicon has (e.g. SiFive's ZSBL which validates and decrypts firmware). On real hardware the boot ROM is vendor-specific, typically 128KB–4MB of mask-programmed or eFuse-configured code.
Repository: https://github.com/vwire/riscv-bare-metal-qemu
Documentation: https://vwire.github.io/riscv-bare-metal-qemu
Structure: docs/ folder with Jekyll or plain HTML. Each project has its own subdirectory with README.md, source files, and a blog page linked from the root index. GitHub Pages serves the docs/ folder automatically when enabled in repo Settings → Pages.
The Zephyr SDK includes its own QEMU binary at a non-standard path. Find it with:
find / -name "qemu-system-riscv64" 2>/dev/null
Returns: /home/vikram/zephyr-sdk-0.17.0/sysroots/x86_64-pokysdk-linux/usr/bin/qemu-system-riscv64. Set this full path in the Makefile's QEMU variable rather than relying on PATH.
fw_jump.S
First code at 0x80000000. Contains the jump target address (0x80200000) as a constant. Calls into fw_base.S after minimal register setup.
fw_base.S — Key Functions
_reset_regs: fence.i + zero all 32 registers. _start: amoswap lottery → one hart wins coldboot. _relocate: GOT relocation loop (adjusts global pointers if loaded at non-link address). _scratch_init: fills per-hart scratch structs for ALL harts. fence rw,rw → _boot_status=1 to release warmboot harts. _start_warm: all harts compute tp, csrw MSCRATCH=tp, call sbi_init().
sbi_init.c
Coldboot: sbi_scratch_init → sbi_heap_init → sbi_domain_init → sbi_hsm_init → sbi_hart_init → sbi_ecall_init → sbi_boot_print_banner. Warmboot: subset of init. Both paths end at sbi_hsm_hart_start_finish() → mret to payload.
sbi_hart.c
sbi_hart_init(): sets MIDELEG (0x1666) / MEDELEG (0xF0B509), calls mstatus_init() which programs MENVCFG for Zicbom/Zicboz/Sstc/Svpbmt. sbi_hart_switch_mode(): MEPC=payload, MSTATUS.MPP=S, mret.
sbi_trap.c
M-mode trap handler. Routes ecalls by (a7=ext, a6=fid) to registered handlers. Forwards unhandled exceptions to S-mode via sbi_trap_redirect(). Handles timer (inject STIP), IPI (set SSIP on target).
div by zero takes ~20–40 cycles (hardware divide-by-zero detection). Used as a multi-cycle delay NOP with zero memory bus transactions.
Without delay: Warmboot harts hammer _boot_status at full pipeline speed — millions of loads/sec on the same cache line, heavy MESI coherency traffic, slowing the boot hart that needs to write all four scratch spaces.
With 3× div: ~60–120 cycles of delay between each load. Bus mostly idle. Boot hart completes scratch init without contention. A standard pattern in multi-hart firmware init.
GOT = Global Offset Table. Compiler stores global variable addresses in a table rather than hardcoding them. If the binary loads at a different address than it was linked at, GOT entries need to be adjusted.
load_offset = actual_load_address − link_address (= 0 on QEMU)
for each Elf64_Rela entry (type R_RISCV_RELATIVE):
*( r_offset + load_offset ) = r_addend + load_offset
On QEMU load_offset=0 so the loop runs but changes nothing. On real hardware that loads OpenSBI at a different address, this loop fixes all global variable pointers to their correct runtime locations. The loop skips entries that are not R_RISCV_RELATIVE (type 3).
That example was hypothetical to illustrate the concept. On QEMU: load_offset = actual(0x80000000) − link(0x80000000) = 0. The example used a hypothetical platform loaded at 0x80100000 while linked at 0x80000000, giving offset=0x00100000. The r_addend in that case was also a link-time address (0x80012340) so adding the offset (0x00100000) correctly gives the runtime address (0x80112340). On QEMU none of this applies.
Uses the Zkr (Entropy Source) extension. The CSR_SEED register delivers 16-bit hardware random entropy when status bits = ES16. The loop collects bits until a full pointer-width value is assembled.
csrrw t1, CSR_SEED, x0 ; read hardware RNG check status == ES16 ; wait for valid entropy accumulate into t0 ; shift and OR 16 bits at a time store to __stack_chk_guard ; GCC's stack canary variable
GCC places this canary between local variables and return address. On overflow detection, execution aborts. Hardware entropy makes it unguessable.
Graceful fallback: csrw MTVEC, __stack_chk_guard_done first — if Zkr absent, csrrw CSR_SEED traps (illegal instruction) and CPU jumps directly to done label, skipping the loop.
_fw_end (0x80020000) heap (grows up toward higher addresses) ┌─────────────────────────────────┐ │ hart 0 stack 8 KB │ ← sp for hart 0 (grows down) │ hart 0 scratch 4 KB │ ← MSCRATCH for hart 0 (struct sbi_scratch) ├─────────────────────────────────┤ │ hart 1 stack 8 KB │ │ hart 1 scratch 4 KB │ ├─────────────────────────────────┤ │ hart 2 stack 8 KB │ │ hart 2 scratch 4 KB │ ├─────────────────────────────────┤ │ hart 3 stack 8 KB │ │ hart 3 scratch 4 KB │ ← lowest address block └─────────────────────────────────┘
Boot hart fills ALL four scratch structs during _scratch_init before releasing warmboot harts. Each scratch contains: fw_start, fw_size, next_addr (0x80200000), next_mode (S), platform pointer, dynamic extension fields.
Three steps: _fw_end + hart_count × stack_size = jump past ALL stack blocks to the very top. − stack_size × hart_index = step back down to the TOP of this hart's specific block. − SBI_SCRATCH_SIZE = the scratch struct sits at the top of the block, so subtract to get its base address.
Result: tp = base address of this hart's struct sbi_scratch. Hart 0 gets the highest block, hart N gets progressively lower. Each hart then sets MSCRATCH = tp.
Hardware MHARTID values may not be contiguous or start at 0 — a system could have harts 0, 2, 4, 6. OpenSBI needs a compact index (0, 1, 2, 3) for array subscripting. hart_index2id[] maps compact index → hardware hartid. The search finds which position matches our MHARTID, returning the compact index used for all array accesses including the scratch/stack address calculation.
sbi_hart_pmp_init() walks the domain's memory region list and programs PMP entries in order (highest-priority first). For each region: converts region flags (SBI_DOMAIN_MEMREGION_M_*) to PMP permission bits (PMP_R/W/X), converts address + size to NAPOT encoding, calls pmp_set(index, prot, addr, log2_size) which writes pmpaddr CSR then pmpcfg CSR.
On QEMU the result is 7 entries: 0=firmware R/W data DENY, 1=firmware code RX, 2=test device RW, 3=CLINT DENY, 4=PLIC S-mode RW, 5=PLIC source RW, 6=all memory S/U RWX (catch-all).
Memory Protection (PMP)
64 PMP entries. Each = one physical address range + R/W/X permissions for S/U-mode. Entry 0 = highest priority. Default = deny-all for S/U if no entry matches. L=1 applies entry to M-mode too. Three modes: TOR (arbitrary range), NA4 (4-byte), NAPOT (power-of-2).
Interrupt Handling
Three privilege levels: M (MTVEC), S (STVEC). MIDELEG/MEDELEG CSRs delegate specific causes to S-mode. scause: bit63=interrupt flag, bits 62:0=cause code. Timer=5, Software/IPI=1, External=9(S)/11(M). MSTATUS/SSTATUS global enable = master switch.
Timer (CLINT)
Shared mtime counter at 10 MHz (QEMU). Per-hart mtimecmp[N]: when mtime≥mtimecmp → MTIP fires on hart N. S-mode uses SBI_EXT_TIME ecall to set timer. Sstc extension adds direct stimecmp CSR access from S-mode.
Fence / Memory Ordering
RVWMO: stores may be seen out-of-order. fence rw,rw: full barrier. fence.i: D→I cache sync (local hart only). sfence.vma: TLB flush (local hart only). For cross-hart: IPI + fence on target hart required.
Cache
L1-I and L1-D separate (Harvard at L1). D-cache coherent between harts via MESI. I-cache NOT coherent with D-cache. Zicbom: cbo.clean/flush/inval. Zicboz: cbo.zero. MENVCFG gates S-mode CBO access.
All reads AND writes before this instruction are globally visible to ALL harts BEFORE any read or write after it begins.
Drains the store buffer. Forces all pending MESI coherency transactions to complete. No reordering can cross this barrier in either direction.
store data ← write data fence rw, rw ← drain → all prior writes visible everywhere store flag = 1 ← guaranteed AFTER data is visible // reading hart: load flag ← sees 1 fence r, r ← flag load before data load load data ← guaranteed to see correct data
Variants: fence w,w (stores only, cheaper), fence r,r (loads only), fence.i (D-cache to I-cache, local only).
MSIP = Machine-mode Software Interrupt Pending. A 4-byte MMIO register in the CLINT at CLINT_BASE + 4×hartid. Writing 1 raises an M-mode software interrupt (MCAUSE=3) on that hart. Used for IPIs. Handler must write 0 to clear or the interrupt fires again immediately.
MSIP vs SSIP: MSIP is triggered by CLINT MMIO (M-mode only). SSIP (bit 1 of SIP) is S-mode visible. OpenSBI can inject SSIP from an MSIP handler to forward an IPI to the S-mode payload without delegating MSIP to S-mode.
ASID = Address Space Identifier. A tag on every TLB entry identifying which process's virtual address space it belongs to. Stored in SATP[59:44] (16 bits on RV64 Sv39).
Why it exists: Without ASIDs, every context switch requires a full TLB flush. With ASIDs, process A (ASID=5) and process B (ASID=7) entries coexist — no flush needed on context switch, just update SATP with the new ASID.
sfence.vma variants: x0,x0=flush all, t0,x0=flush one VA all ASIDs, x0,t1=flush all entries for one ASID, t0,t1=flush single VA+ASID. G-bit in PTE = Global (visible to all ASIDs — used for kernel mappings shared by all processes).
CLINT — Core Local Interruptor (0x02000000)
Single SoC-level peripheral. Two interrupt types: timer (MTIP) and software/IPI (MSIP). Registers: msip[N] at +4N (4 bytes), mtimecmp[N] at +0x4000+8N (8 bytes), mtime at +0xBFF8 (shared 64-bit). S-mode access blocked by PMP — use SBI ecalls.
PLIC — Platform Level Interrupt Controller (0x0C000000)
Up to 1023 external interrupt sources (UART, disk, NIC…). Per-hart contexts: M-mode (hartid×2), S-mode (hartid×2+1). Priority 1–7 per source. Threshold per context — only priority > threshold delivered. Claim/Complete protocol: read claim → handle → write complete. S-mode has direct PLIC access via PMP Regions 04+05.
Each hart has its own completely independent instruction pipeline. The pipeline IS what makes a hart a hardware thread.
| Resource | Scope |
|---|---|
| PC, register file (x0–x31), all CSRs | Per hart (private) |
| L1-I cache, L1-D cache | Per hart (private) |
| ALU, multiplier, FPU, load-store unit, L2 | Shared within core (SMT) |
| L3/LLC, DRAM, CLINT, PLIC, UART | Shared SoC-wide |
2-hart SMT core uses ~30–40% more silicon than a single-hart core, vs 100% more for two separate cores. Two harts share execution resources while maintaining completely independent execution state.
No — CLINT is shared across the entire SoC. It is a single MMIO peripheral on the system bus, physically outside all cores. It has per-hart register slots (msip[N], mtimecmp[N]) but all those slots live in one device. Any hart on any core can read/write any slot — that is how cross-hart IPIs work: hart 0 on core 0 writes msip[3] to interrupt hart 3 on core 1.
CLINT = Core Local Interruptor. "Core local" because each hart has dedicated register slots, though the hardware itself is SoC-wide.
Sstc = Supervisor-mode Standard Timer Comparison extension. Adds stimecmp CSR so S-mode can directly arm its own timer without an SBI ecall. OpenSBI enables via MENVCFG.STCE if detected at runtime.
| Prefix | Meaning | Examples |
|---|---|---|
| Sx | Supervisor standard | Sstc, Svpbmt, Svadu |
| Smx | Machine standard | Smepmp, Smaia |
| Zx | Standard sub-extension | Zicbom, Zicboz, Zkr |
| Xx | Vendor/non-standard | Xsifivecease |
Yes. OpenSBI is designed for this. All harts start simultaneously, amoswap lottery elects coldboot hart. Coldboot initialises scratch for ALL harts. All call sbi_init() — one coldboot path, others warmboot. All mret to payload with a0=hartid, a1=DTB.
SBI extensions for SMP: HSM (SBI_EXT_HSM — hart_start/stop/suspend), IPI (SBI_EXT_IPI — cross-hart interrupts), RFENCE (cross-hart TLB/I-cache shootdowns). Linux uses HSM to bring up CPUs one by one after the boot CPU initialises the kernel.
Hardware Atomics
AMO (amoswap, amoadd, amoand…): indivisible read-modify-write in one bus transaction. LR/SC: lr.w sets reservation; sc.w succeeds only if reservation not cancelled. .aq=acquire ordering, .rl=release ordering.
Spinlocks (built on amoswap)
Try to swap 1 in; if old=0 → lock acquired; else read until 0 (avoid bus contention) then retry. Release: amoswap 0 with .rl.
Memory Ordering (RVWMO)
Stores visible out-of-order across harts. Producer: fence w,w between data write and flag write. Consumer: fence r,r between flag read and data read. AMO .aq/.rl suffixes embed ordering into the atomic itself.
Scheduling and Context Switch
Per-hart run queues (low contention). Timer interrupt (~1ms) preempts tasks. Empty queue → steal half tasks from busiest hart. Context switch: save all 32 regs + sepc + sstatus + satp → restore → sfence.vma → sret.
volatile — always read/write actual memory; compiler must never cache in a register. Necessary because multiple harts modify this concurrently and the compiler cannot see that from any single hart's perspective.
uint64_t — 64-bit unsigned integer per element. Matches RV64 natural register width.
timer_count[NUM_HARTS] — array of 4. Each element: how many times that hart's timer has fired. Each hart writes only its own slot (timer_count[hartid]++) so no spinlock needed for this array.
IN_DATA = __attribute__((section(".data"))) — forces into .data section (not .bss). Our runtime BSS clear had a GP-relative addressing bug. Forcing to .data means the ELF loader initialises values before _start runs — no runtime clear needed.
= {0,0,0,0} — initial values baked into the ELF binary, copied by the ELF loader before any instruction runs.
No. fence rw,rw controls memory ordering — when a hart's own stores become visible to others. It cannot stop other harts from independently accessing memory.
The LR/SC livelock was a cache line contention problem: primary's loads of hart_ready[] were generating MESI coherency transactions on the same cache line as uart_lock, cancelling secondary's LR reservation. The fence cannot prevent other harts from running their own load instructions independently. Only amoswap fixes it — an indivisible AMO has no reservation gap that can be cancelled.
CSR Format
pmpcfg0–15: each holds 8 config bytes (one per entry). Each byte: L(7)=lock, WPRI(6:5)=reserved, A(4:3)=address mode, X(2)=execute, W(1)=write, R(0)=read. pmpaddr0–63: stores PA[55:2] (address right-shifted by 2). Always write pmpaddr before pmpcfg to avoid activating with wrong address.
Address Modes (A field)
| Mode | Value | Range | Use |
|---|---|---|---|
| OFF | 00 | Disabled | Unused entries |
| TOR | 01 | [prev×4, this×4) | Arbitrary ranges, needs 2 entries |
| NA4 | 10 | Exactly 4 bytes | Single register |
| NAPOT | 11 | Power-of-2 aligned | Most common — pmpaddr=(base>>2)|((size/8)-1) |
Check Algorithm
Walk entries 0→63. First match wins. Match + check R/W/X against access type. M-mode bypasses unless L=1. No match: M-mode passes, S/U-mode = access fault (1=fetch, 5=load, 7=store).
Intentional priority layering. Entry 0 (128KB, DENY) sits inside Entry 1's larger range (256KB, RX). First match wins, so Entry 0 blocks the sensitive 128KB while Entry 1 allows the remaining 128KB.
0x80000000–0x8001FFFF → Entry 0 matches first → DENY (firmware R/W data protected) 0x80020000–0x8003FFFF → Entry 0 no match → Entry 1 → RX (firmware code readable)
Standard PMP technique for "allow X, except deny this smaller sub-region."
Region = single hardware PMP entry: one address range + one permission set. The CPU knows only about flat PMP entries — checked in order on every memory access.
Domain = OpenSBI software concept: a named security partition owning a set of harts, a collection of memory regions, a next-boot address/mode, and allowed SBI extensions. OpenSBI programs PMP entries to enforce the domain's regions.
Domain (OpenSBI policy) → Region (software struct) → PMP Entry (hardware CSR)
Simple systems have one domain (root, all harts). Multi-tenant systems use multiple domains on different harts with different PMP rules enforcing hardware isolation between workloads.
They are two completely separate scratch spaces at different addresses.
| OpenSBI (M-mode) | Your payload (S-mode) | |
|---|---|---|
| Location | 0x80040000–0x8005FFFF | 0x80200Cxx–0x80201Cxx |
| PMP | DENY to S-mode | S-mode accessible |
| Scratch CSR | MSCRATCH (M-mode) | SSCRATCH (S-mode) |
| Stack pointer | OpenSBI's sp | Your payload's sp |
| Used by | sbi_init, trap handlers | primary_main, secondary_main |
Both exist simultaneously. When an ecall switches S→M, hardware saves S-mode sp and switches to M-mode's stack. They are two independent execution contexts on the same physical CPU.
Idempotent = doing something twice gives the same result as once. For memory reads: reading the same address twice returns the same value and changes nothing in the hardware.
Normal RAM is always idempotent. A UART receive buffer is NOT — reading consumes the byte (first read = 'H', second read = empty). The I flag tells the memory subsystem: reads here are safe to speculate, prefetch, or cache-check without triggering accidental side effects.
Regions NOT marked I must be accessed with strict ordering — no speculation or reordering allowed. On QEMU, Region02 (syscon 0x00100000) is idempotent — reads return a fixed value, writes trigger poweroff (0x5555) or reboot (0x7777).
| Region | Address | S/U Access | Meaning |
|---|---|---|---|
| Region00 | 0x80040000–0x8005FFFF | DENY | OpenSBI data/scratch/stacks — protected from S-mode |
| Region01 | 0x80000000–0x8003FFFF | R (via R06) | OpenSBI code — S-mode can read but not write |
| Region02 | 0x00100000–0x00100FFF | R W | Syscon test device — I=idempotent, writes trigger poweroff/reboot |
| Region03 | 0x02000000–0x0200FFFF | DENY | CLINT — timer and IPI hardware, M-mode only via SBI |
| Region04 | 0x0C400000–0x0C5FFFFF | R W | PLIC S-mode context registers — direct access for payload |
| Region05 | 0x0C000000–0x0C3FFFFF | R W | PLIC source/priority/enable registers — direct access |
| Region06 | 0x0–0xFFFF... | R W X | Catch-all — grants S/U access to everything not covered above |
Organisation
Address split: TAG | INDEX | OFFSET. 32KB 4-way: offset=6b(64-byte lines), index=7b(128 sets), tag=43b(for 56-bit PA). N-way set associative: index→set→N parallel tag comparisons. Direct-mapped=1 way (conflict misses). Fully assoc=1 set (no conflicts, expensive).
Hierarchy
L1-I (per hart, fetch only, not coherent with D) + L1-D (per hart, load/store, MESI coherent) + L2 (shared per core) + L3 (shared all cores).
Write Policies
Write-back (default): update cache only, writeback on eviction. Write-through: update both cache and memory on every write. Write-allocate: on write miss, fetch line then update (pairs with write-back). LRU/PLRU/Random replacement when evicting.
RISC-V Mechanisms
fence.i: flush local I-cache. fence rw,rw: memory ordering. sfence.vma: TLB flush. cbo.clean (Zicbom): writeback dirty for DMA. cbo.inval: invalidate for DMA. cbo.zero (Zicboz): zero a cache line without fetching (~8× faster than sd loop). All gated by MENVCFG from M-mode.
Way = one complete cache slot: TAG + valid bit + dirty bit + 64 bytes of data. Like one parking space.
Set = a group of ways sharing the same index address. When the CPU accesses memory, index bits select one set and ALL ways in that set are checked simultaneously (parallel tag comparison). Like a row of parking spaces at the same address.
Way 0 Way 1 Way 2 Way 3 Set 0 [ tag|data ][ tag|data ][ tag|data ][ tag|data ] Set 1 [ tag|data ][ tag|data ][ tag|data ][ tag|data ] ... Set127 [ tag|data ][ tag|data ][ tag|data ][ tag|data ]
On access: index bits → select row (set) → compare all N tags → HIT returns data / MISS fetches from L2 into one way of that set (chosen by replacement policy).
It is a hardware design decision, not derived. The CPU designer chooses PA width based on: maximum RAM the chip needs to address, pin count, power budget. RISC-V specifies up to 56-bit PA for Sv39/Sv48/Sv57 paging modes. Real chips implement fewer: SiFive U74=56, StarFive JH7110=40, embedded=32.
Discover at runtime by writing all-ones to SATP.PPN and reading back how many bits stuck. In cache calculations: PA=56 is given as a specification → tag = 56 − index_bits − offset_bits. The tag must be wide enough that no two different PAs map to identical tag+index.
1024×1024 float matrix, row=4096 bytes, 32KB direct-mapped cache, line=64B, 512 sets. Index bits=PA[14:6]. Addresses 32KB apart map to the same set.
The thrash: Reading A[row][col] is sequential (fine). Writing B[col][row] jumps one row (4096B) per column. If base_A and base_B are 32KB apart: A[row][col] and B[col][row] map to the same cache set. Load A evicts B, load B evicts A — every access misses → 300× slower.
Fixes: Higher associativity (4-way: 4 addresses coexist in one set — eliminates most conflicts). Cache tiling: work on 32×32 submatrices fitting in cache so A and B tiles coexist without eviction.
L1-I and L1-D are physically separate SRAM arrays. Writing new code through D-cache does not propagate to I-cache automatically.
sd t0, 0x80001000 → new instruction in D-cache
I-cache at same address: STILL OLD CODE
jalr ra, 0x80001000 → I-cache HIT: executes OLD code → WRONG
With fence.i: (1) flush dirty D-cache lines to L2, (2) invalidate all I-cache lines, (3) flush pipeline. Next fetch goes to L2 and retrieves new code.
Cross-hart: fence.i only affects the LOCAL hart. For other harts: fence rw,rw to make D-write visible + send IPI via msip + target hart executes fence.i. OpenSBI provides this via SBI_EXT_RFENCE.
Yes — but implementation-defined. The spec guarantees the outcome (subsequent fetches see prior stores), not the method.
- Most common (SiFive U74): Full I-cache flush — all valid bits → 0. Cost ~15–30 cycles.
- Selective: Some cores invalidate only matching lines (faster, more hardware).
- No I-cache (embedded): Just a pipeline flush.
Write-back order matters: dirty D-cache lines must flush to L2 before I-cache invalidation — otherwise the I-cache miss would fetch stale data from L2. Correct order: flush dirty D → invalidate I → flush pipeline → resume fetch.
On QEMU: fence.i calls tb_flush() — discards all JIT translation blocks, forces re-translation from fresh memory.
T0 Power-on. All 4 harts: PC=0x1000 (MROM). _boot_lottery=0.
T1 All harts race to _start. amoswap on _boot_lottery.
Bus serialises 4 simultaneous requests. One hart reads
back 0 (winner). Memory: _boot_lottery → 1.
Other 3 harts fall through to warmboot spin loop.
T2 Boot hart writes: GOT relocation, BSS zeros,
ALL FOUR scratch structs (fw_start, fw_size,
next_addr=0x80200000, next_mode=S, platform ptr).
All writes dirty in boot hart's L1-D.
T3 fence rw,rw → drains store buffer → all writes
globally visible via MESI. _boot_status=1 →
MESI invalidates Shared line in warmboot harts →
they miss on next load → read 1 → exit spin.
T4 All harts in _start_warm. Each computes own tp
(no conflict — different addresses). All csrw MSCRATCH=tp.
T5 sbi_init() C-level atomic → coldboot/warmboot split.
Coldboot: full domain, HSM, ecall, banner init.
Warmboot: wait on coldboot_done flag.
T6 wake_coldboot_harts() → fence → coldboot_done=1.
Warmboot harts exit → each does per-hart init.
T7 All harts: sbi_hart_switch_mode() →
MEPC=0x80200000, MSTATUS.MPP=S, mret.
T8 All 4 harts at payload _start, S-mode.
a0=hartid, a1=DTB. Your code runs.
Yes. Project 3 had no -smp flag → QEMU default = 1 CPU, 1 hart. OpenSBI banner: Platform HART Count: 1. Only hart 0 ran. Warmboot path never taken. Payload entry.S had no multi-hart handling — one stack, one call to payload_main().
Project 4 added -smp cpus=4,cores=2,threads=2 → 4 harts → all four arrive at _start simultaneously with non-deterministic boot hart.
Layer 1 — AMO Instructions
amoswap/amoadd/amoand: indivisible read-modify-write in one bus transaction. LR/SC: reservation-based; cancelled if another hart touches the same cache line between LR and SC.
Layer 2 — Spinlocks
amoswap 1 in → old=0 means acquired, old=1 means wait. Read until 0 (avoids bus-locking contention while spinning), then retry. Release: amoswap 0 with .rl ordering. OpenSBI uses spinlocks for heap (sbi_malloc) and IPI queues.
Layer 3 — IPI Protocol
Hart A: writes message to shared memory + fence rw,rw + writes msip[B]=1 to CLINT. CLINT raises MSIP on hart B. Hart B's trap handler processes message + writes msip[B]=0 to clear.
Layer 4 — Scheduling + Work Stealing
Per-hart run queues (minimal cross-hart contention). Timer interrupt (~1ms) preempts. Empty queue → steal half of tasks from busiest hart. Context switch: save 32 regs + sepc + sstatus + satp → restore incoming task → sfence.vma → sret.
| Design Choice | Why |
|---|---|
| Atomic lottery for primary | OpenSBI boot hart is non-deterministic. Lottery ensures any hart can win and the code works correctly on every run. |
| HSM to start secondaries | Secondaries that lost lottery are in wfi with no stack. HSM gives them a clean S-mode entry with correct registers at _secondary_entry. |
| amoswap for UART lock | LR/SC reservations are cancelled by adjacent cache line accesses. amoswap is indivisible — cannot be interrupted between read and write. |
| uart_lock in padded struct | Ensures uart_lock occupies its own 64-byte cache line so hart_ready[] accesses cannot interfere with the spinlock. |
| All shared vars in .data | Runtime BSS clear had a GP-relative bug. .data means ELF loader initialises correctly before _start runs. |
| wfi in wait loops | Tight spin on hart_ready[] shared uart_lock's cache line, cancelling LR/SC reservations. wfi yields CPU and eliminates bus traffic. |
| primary_hartid variable | Primary can be any hart. Secondaries need to know which hart to IPI to wake it from wfi — cannot hardcode hart 0. |
Imagine a shared whiteboard in an office. Person A (secondary) wants to write on it and places a sticky note: "A is about to write — do not touch." That sticky note is the LR reservation.
Person B (primary) is not writing — they just keep walking past the same whiteboard every second to check a timetable pinned next to A's sticky note. Every time B touches that corner of the board, A's sticky note falls off. A puts a new one down. B walks past again. Falls off again. A can never write. Forever.
In hardware: uart_lock (at 0x80200b10) and hart_ready[] (at 0x80200b18) were 8 bytes apart — same 64-byte cache line. Every lw from primary loading hart_ready[] generated a MESI coherency transaction on that cache line, cancelling secondary's LR reservation. sc.w failed every retry.
The raw sb (direct store) still worked because it needs no reservation. The amoswap fix works because it does both read and write in one indivisible bus transaction — no gap where another hart can interfere.
Root cause: __global_pointer$ in linker.ld enabled GP-relative addressing. GCC emits lw a5, offset_from_GP(gp) for global variable access. The assembler, seeing that uart_lock happened to be at the same address as _bss_end (both 0x80200b60), substituted uart_lock's GP-relative address as the loop bound. The BSS clear started at hart_ready (_bss_start) and stopped at uart_lock (_bss_end) — but uart_lock was in that range and was NEVER zeroed. It contained garbage → uart_acquire() spinlock hung forever on first call.
Fix: Use callee-saved registers s2/s3 with explicit la for BSS loop bounds (assembler cannot substitute GP-relative). Force all shared vars to .data — ELF loader initialises them; no runtime BSS clear needed.
Root cause: uart_lock and hart_ready[] shared the same 64-byte cache line (uart_lock at 0x80200b10, hart_ready at 0x80200b18 — only 8 bytes apart). Primary's tight spin on hart_ready[] generated continuous MESI coherency transactions on that cache line, cancelling secondary's LR reservation on uart_lock before sc.w could complete. Livelock — secondary could never acquire the UART lock.
sb to UART printed 'A' (no reservation needed). hprint() hung (needs LR/SC lock). The diagnostic showed exactly which layer failed.Fix: (1) Wrap uart_lock in a struct with 60-byte padding, aligned(64) → own cache line. (2) Replace LR/SC with amoswap.w.aq (indivisible). (3) Replace tight spin with wfi in wait_all_harts_ready.
Root cause: Code had beqz s0, _primary — only hart 0 became primary. When OpenSBI's boot hart was 1/2/3, that hart took the coldboot path and arrived at _start. Hart 0 was a warmboot hart, held in OpenSBI's sbi_hsm_hart_wait() waiting for an HSM hart_start call. Nobody called it — hart 0 never reached _start at all. The other harts' boot diagnostics (digits at _start) also did not print, confirming they never arrived.
Fix: Atomic lottery — amoswap.w.aq on _boot_lottery. Whichever hart arrives first becomes primary regardless of hartid. Primary then HSM-starts all other harts at _secondary_entry. This is how Linux boots secondary CPUs.
Root cause: GCC uses the GP (global pointer) register for GP-relative addressing of global variables. When OpenSBI delivered secondary harts to _secondary_entry via HSM, GP contained whatever OpenSBI left in it — pointing into firmware memory, not payload .data. The first global variable access (amoswap on &uart_lock in uart_acquire) used the garbage GP address → store access fault.
Fix: Add GP initialisation in _secondary_entry before any C call:
.option push .option norelax ← critical: prevents assembler from optimising away la gp la gp, __global_pointer$ .option popMust also add the same in _primary. The same bug existed in the previous (pre-lottery) version but was hidden because only hart 0 ran primary and hart 0 happened to have a usable GP from OpenSBI.
Root cause: secondary_main had hardcoded send_ipi_to_hart(0). When primary was hart 1/2/3, all three secondaries IPId hart 0 (a secondary) instead of primary. Primary stayed stuck in wait_all_harts_ready() forever because nobody IPId it. Never woke up, never enabled its timer, never reached the event loop. Timer count for primary stayed at 0. Demo never completed.
Fix: Add volatile uint64_t primary_hartid shared variable. Primary writes its own hartid with a fence rw,rw before HSM-starting secondaries (guarantees secondaries can read it safely). Secondaries call send_ipi_to_hart(primary_hartid).
[Primary] Winning hart: N ← N = whichever hart won the amoswap lottery (any 0–3) [Primary] SBI spec version: 3.0 ← confirmed via SBI_EXT_BASE ecall [Primary] Starting all other harts via HSM... [Primary] HSM start hart X → OK ← sbi_hsm_hart_start() returned error=0 for each [Hart X] Secondary started. ← secondary reached secondary_main, acquired UART lock [Hart N] IPI received #1 ← secondary sent IPI to wake primary from wfi [Primary] All 4 harts running. ← all hart_ready[] flags set, wait loop exited [Hart N] Starting timer... ← primary armed its timer via sbi_set_timer() [Hart N] Timer IRQ #1 ← trap_handler fired for STIP (scause=0x8000...0005) [Hart N] mtime = 0x... ← current CLINT time counter value [Hart N] shared_ctr = 1 ← atomic amoadd.d.aqrl incremented shared_counter ... (12 timer IRQs total, 3 per hart, staggered by N+1 intervals) [Primary] Sending IPI to all secondary harts... ← primary sent SSIP to harts 1,2,3 [Hart X] IPI received #1 ← SSIP delivered (scause=0x8000...0001), csrc sip cleared [Hart X] Secondary done. Halting. ← while() loop exited, hart enters infinite wfi shared_counter = 12 (expected 12) ← 4 harts × 3 fires = 12, atomic counter correct Demo complete. System halted.