RISC-V Complete Q&A Reference

Setup & Build (6 questions)

How do I set up VS Code / Eclipse CDT for debugging RISC-V C/Assembly code from OpenSBI on QEMU with a GUI?

Eclipse CDT approach (recommended)

Install Eclipse CDT + CDT GDB Hardware Debugging plugin. Create a C project, set cross-compiler prefix to riscv64-unknown-elf-. In Debug Configurations → GDB Hardware Debugging: GDB command = riscv64-unknown-elf-gdb, target = localhost:1234.

QEMU side (debug_qemu.sh)

qemu-system-riscv64 -machine virt -nographic \
  -bios fw_jump.elf -kernel payload.elf -S -s

-S = pause at start. -s = GDB server on port 1234. Connect Eclipse CDT GDB and step through OpenSBI + payload source simultaneously.

Initial GDB commands

set arch riscv:rv64
add-symbol-file fw_jump.elf 0x80000000
hbreak _fw_start
hbreak sbi_init
hbreak *0x80200000
continue

Assembler error: unrecognized option -mcmodel=medany in ASFLAGS

-mcmodel=medany is a GCC compiler flag only — not valid for the assembler (as). Remove it from ASFLAGS, keep only in CFLAGS.

# Wrong:
ASFLAGS = -g $(ARCH) -mcmodel=medany -mno-relax
# Fixed:
ASFLAGS = -g $(ARCH)

OpenSBI build fails: "linker does not support creating PIEs"

riscv64-unknown-elf- (bare-metal toolchain) does not support PIE. OpenSBI uses Linux build conventions including PIE. Fix: use the Linux-target toolchain.

sudo apt install gcc-riscv64-linux-gnu binutils-riscv64-linux-gnu
make CROSS_COMPILE=riscv64-linux-gnu- PLATFORM=generic \
     FW_JUMP=y FW_JUMP_ADDR=0x80200000 FW_TEXT_START=0x80000000

Toolchain	Target	PIE	Use for
riscv64-unknown-elf-	Bare metal	No	Your payload, zsbl.S
riscv64-linux-gnu-	Linux ABI	Yes	OpenSBI, Linux kernel

What is the code at 0x00000000 that QEMU provides before OpenSBI?

Called the MROM (Mask ROM) — a small read-only region QEMU emulates at 0x1000 (not 0x0). Contains ~8 instructions: a few NOPs, then an auipc+jalr that jumps to 0x80000000 (where OpenSBI/BIOS is loaded). Also contains the DTB pointer in a register.

This is NOT a real-world zero-state bootloader — it is QEMU's simplified substitute for the chip-level boot ROM that real silicon has (e.g. SiFive's ZSBL which validates and decrypts firmware). On real hardware the boot ROM is vendor-specific, typically 128KB–4MB of mask-programmed or eFuse-configured code.

How do I push projects to GitHub and create shareable documentation links?

Repository: https://github.com/vwire/riscv-bare-metal-qemu
Documentation: https://vwire.github.io/riscv-bare-metal-qemu

Structure: docs/ folder with Jekyll or plain HTML. Each project has its own subdirectory with README.md, source files, and a blog page linked from the root index. GitHub Pages serves the docs/ folder automatically when enabled in repo Settings → Pages.

QEMU binary not found — qemu-system-riscv64 not in PATH

The Zephyr SDK includes its own QEMU binary at a non-standard path. Find it with:

find / -name "qemu-system-riscv64" 2>/dev/null

Returns: /home/vikram/zephyr-sdk-0.17.0/sysroots/x86_64-pokysdk-linux/usr/bin/qemu-system-riscv64. Set this full path in the Makefile's QEMU variable rather than relying on PATH.

OpenSBI Source Code (9 questions)

File-wise line-wise description of OpenSBI source — fw_jump.S, fw_base.S, sbi_init.c, sbi_hart.c, sbi_trap.c

fw_jump.S

First code at 0x80000000. Contains the jump target address (0x80200000) as a constant. Calls into fw_base.S after minimal register setup.

fw_base.S — Key Functions

_reset_regs: fence.i + zero all 32 registers. _start: amoswap lottery → one hart wins coldboot. _relocate: GOT relocation loop (adjusts global pointers if loaded at non-link address). _scratch_init: fills per-hart scratch structs for ALL harts. fence rw,rw → _boot_status=1 to release warmboot harts. _start_warm: all harts compute tp, csrw MSCRATCH=tp, call sbi_init().

sbi_init.c

Coldboot: sbi_scratch_init → sbi_heap_init → sbi_domain_init → sbi_hsm_init → sbi_hart_init → sbi_ecall_init → sbi_boot_print_banner. Warmboot: subset of init. Both paths end at sbi_hsm_hart_start_finish() → mret to payload.

sbi_hart.c

sbi_hart_init(): sets MIDELEG (0x1666) / MEDELEG (0xF0B509), calls mstatus_init() which programs MENVCFG for Zicbom/Zicboz/Sstc/Svpbmt. sbi_hart_switch_mode(): MEPC=payload, MSTATUS.MPP=S, mret.

sbi_trap.c

M-mode trap handler. Routes ecalls by (a7=ext, a6=fid) to registered handlers. Forwards unhandled exceptions to S-mode via sbi_trap_redirect(). Handles timer (inject STIP), IPI (set SSIP on target).

How does the div t2,t2,zero pattern reduce bus traffic in the warmboot wait loop?

div by zero takes ~20–40 cycles (hardware divide-by-zero detection). Used as a multi-cycle delay NOP with zero memory bus transactions.

Without delay: Warmboot harts hammer _boot_status at full pipeline speed — millions of loads/sec on the same cache line, heavy MESI coherency traffic, slowing the boot hart that needs to write all four scratch spaces.

With 3× div: ~60–120 cycles of delay between each load. Bus mostly idle. Boot hart completes scratch init without contention. A standard pattern in multi-hart firmware init.

What does the GOT relocation loop mean in fw_base.S?

GOT = Global Offset Table. Compiler stores global variable addresses in a table rather than hardcoding them. If the binary loads at a different address than it was linked at, GOT entries need to be adjusted.

load_offset = actual_load_address − link_address   (= 0 on QEMU)
for each Elf64_Rela entry (type R_RISCV_RELATIVE):
    *( r_offset + load_offset ) = r_addend + load_offset

On QEMU load_offset=0 so the loop runs but changes nothing. On real hardware that loads OpenSBI at a different address, this loop fixes all global variable pointers to their correct runtime locations. The loop skips entries that are not R_RISCV_RELATIVE (type 3).

How can the GOT load_offset be 0x80100000 — isn't it a difference between link base and load base?

That example was hypothetical to illustrate the concept. On QEMU: load_offset = actual(0x80000000) − link(0x80000000) = 0. The example used a hypothetical platform loaded at 0x80100000 while linked at 0x80000000, giving offset=0x00100000. The r_addend in that case was also a link-time address (0x80012340) so adding the offset (0x00100000) correctly gives the runtime address (0x80112340). On QEMU none of this applies.

Explain the stack guard initialisation using the Zkr extension in fw_base.S

Uses the Zkr (Entropy Source) extension. The CSR_SEED register delivers 16-bit hardware random entropy when status bits = ES16. The loop collects bits until a full pointer-width value is assembled.

csrrw t1, CSR_SEED, x0   ; read hardware RNG
check status == ES16       ; wait for valid entropy
accumulate into t0         ; shift and OR 16 bits at a time
store to __stack_chk_guard ; GCC's stack canary variable

GCC places this canary between local variables and return address. On overflow detection, execution aborts. Hardware entropy makes it unguessable.

Graceful fallback: csrw MTVEC, __stack_chk_guard_done first — if Zkr absent, csrrw CSR_SEED traps (illegal instruction) and CPU jumps directly to done label, skipping the loop.

Explain with a diagram how scratch and stack memory are allocated per hart in OpenSBI

_fw_end (0x80020000)
  heap (grows up toward higher addresses)
  ┌─────────────────────────────────┐
  │  hart 0 stack  8 KB             │  ← sp for hart 0 (grows down)
  │  hart 0 scratch 4 KB            │  ← MSCRATCH for hart 0 (struct sbi_scratch)
  ├─────────────────────────────────┤
  │  hart 1 stack  8 KB             │
  │  hart 1 scratch 4 KB            │
  ├─────────────────────────────────┤
  │  hart 2 stack  8 KB             │
  │  hart 2 scratch 4 KB            │
  ├─────────────────────────────────┤
  │  hart 3 stack  8 KB             │
  │  hart 3 scratch 4 KB            │  ← lowest address block
  └─────────────────────────────────┘

Boot hart fills ALL four scratch structs during _scratch_init before releasing warmboot harts. Each scratch contains: fw_start, fw_size, next_addr (0x80200000), next_mode (S), platform pointer, dynamic extension fields.

What does tp = _fw_end + (hart_count × stack_size) − (stack_size × hart_index) − SBI_SCRATCH_SIZE compute?

Three steps: _fw_end + hart_count × stack_size = jump past ALL stack blocks to the very top. − stack_size × hart_index = step back down to the TOP of this hart's specific block. − SBI_SCRATCH_SIZE = the scratch struct sits at the top of the block, so subtract to get its base address.

Result: tp = base address of this hart's struct sbi_scratch. Hart 0 gets the highest block, hart N gets progressively lower. Each hart then sets MSCRATCH = tp.

Why does OpenSBI search hart_index2id[] after reading MHARTID?

Hardware MHARTID values may not be contiguous or start at 0 — a system could have harts 0, 2, 4, 6. OpenSBI needs a compact index (0, 1, 2, 3) for array subscripting. hart_index2id[] maps compact index → hardware hartid. The search finds which position matches our MHARTID, returning the compact index used for all array accesses including the scratch/stack address calculation.

Line-wise description of the PMP initialisation code in sbi_hart_pmp.c

sbi_hart_pmp_init() walks the domain's memory region list and programs PMP entries in order (highest-priority first). For each region: converts region flags (SBI_DOMAIN_MEMREGION_M_*) to PMP permission bits (PMP_R/W/X), converts address + size to NAPOT encoding, calls pmp_set(index, prot, addr, log2_size) which writes pmpaddr CSR then pmpcfg CSR.

On QEMU the result is 7 entries: 0=firmware R/W data DENY, 1=firmware code RX, 2=test device RW, 3=CLINT DENY, 4=PLIC S-mode RW, 5=PLIC source RW, 6=all memory S/U RWX (catch-all).

RISC-V Architecture (12 questions)

Explain RISC-V memory protection, interrupt handling, timer, fence, and cache

Memory Protection (PMP)

64 PMP entries. Each = one physical address range + R/W/X permissions for S/U-mode. Entry 0 = highest priority. Default = deny-all for S/U if no entry matches. L=1 applies entry to M-mode too. Three modes: TOR (arbitrary range), NA4 (4-byte), NAPOT (power-of-2).

Interrupt Handling

Three privilege levels: M (MTVEC), S (STVEC). MIDELEG/MEDELEG CSRs delegate specific causes to S-mode. scause: bit63=interrupt flag, bits 62:0=cause code. Timer=5, Software/IPI=1, External=9(S)/11(M). MSTATUS/SSTATUS global enable = master switch.

Timer (CLINT)

Shared mtime counter at 10 MHz (QEMU). Per-hart mtimecmp[N]: when mtime≥mtimecmp → MTIP fires on hart N. S-mode uses SBI_EXT_TIME ecall to set timer. Sstc extension adds direct stimecmp CSR access from S-mode.

Fence / Memory Ordering

RVWMO: stores may be seen out-of-order. fence rw,rw: full barrier. fence.i: D→I cache sync (local hart only). sfence.vma: TLB flush (local hart only). For cross-hart: IPI + fence on target hart required.

Cache

L1-I and L1-D separate (Harvard at L1). D-cache coherent between harts via MESI. I-cache NOT coherent with D-cache. Zicbom: cbo.clean/flush/inval. Zicboz: cbo.zero. MENVCFG gates S-mode CBO access.

What does fence rw, rw do?

All reads AND writes before this instruction are globally visible to ALL harts BEFORE any read or write after it begins.

Drains the store buffer. Forces all pending MESI coherency transactions to complete. No reordering can cross this barrier in either direction.

store data          ← write data
fence rw, rw        ← drain → all prior writes visible everywhere
store flag = 1      ← guaranteed AFTER data is visible

// reading hart:
load flag           ← sees 1
fence r, r          ← flag load before data load
load data           ← guaranteed to see correct data

Variants: fence w,w (stores only, cheaper), fence r,r (loads only), fence.i (D-cache to I-cache, local only).

What is MSIP?

MSIP = Machine-mode Software Interrupt Pending. A 4-byte MMIO register in the CLINT at CLINT_BASE + 4×hartid. Writing 1 raises an M-mode software interrupt (MCAUSE=3) on that hart. Used for IPIs. Handler must write 0 to clear or the interrupt fires again immediately.

MSIP vs SSIP: MSIP is triggered by CLINT MMIO (M-mode only). SSIP (bit 1 of SIP) is S-mode visible. OpenSBI can inject SSIP from an MSIP handler to forward an IPI to the S-mode payload without delegating MSIP to S-mode.

What does ASID mean in RISC-V TLB context?

ASID = Address Space Identifier. A tag on every TLB entry identifying which process's virtual address space it belongs to. Stored in SATP[59:44] (16 bits on RV64 Sv39).

Why it exists: Without ASIDs, every context switch requires a full TLB flush. With ASIDs, process A (ASID=5) and process B (ASID=7) entries coexist — no flush needed on context switch, just update SATP with the new ASID.

sfence.vma variants: x0,x0=flush all, t0,x0=flush one VA all ASIDs, x0,t1=flush all entries for one ASID, t0,t1=flush single VA+ASID. G-bit in PTE = Global (visible to all ASIDs — used for kernel mappings shared by all processes).

CLINT and PLIC — expand both and explain

CLINT — Core Local Interruptor (0x02000000)

Single SoC-level peripheral. Two interrupt types: timer (MTIP) and software/IPI (MSIP). Registers: msip[N] at +4N (4 bytes), mtimecmp[N] at +0x4000+8N (8 bytes), mtime at +0xBFF8 (shared 64-bit). S-mode access blocked by PMP — use SBI ecalls.

PLIC — Platform Level Interrupt Controller (0x0C000000)

Up to 1023 external interrupt sources (UART, disk, NIC…). Per-hart contexts: M-mode (hartid×2), S-mode (hartid×2+1). Priority 1–7 per source. Threshold per context — only priority > threshold delivered. Claim/Complete protocol: read claim → handle → write complete. S-mode has direct PLIC access via PMP Regions 04+05.

Key distinction: CLINT = internal hart-level interrupts (timer, IPI). PLIC = external device interrupts. CLINT is always M-mode via SBI. PLIC S-mode contexts are directly accessible to your payload.

Does each hart have its own instruction pipeline or share with other harts on the same core?

Each hart has its own completely independent instruction pipeline. The pipeline IS what makes a hart a hardware thread.

Resource	Scope
PC, register file (x0–x31), all CSRs	Per hart (private)
L1-I cache, L1-D cache	Per hart (private)
ALU, multiplier, FPU, load-store unit, L2	Shared within core (SMT)
L3/LLC, DRAM, CLINT, PLIC, UART	Shared SoC-wide

2-hart SMT core uses ~30–40% more silicon than a single-hart core, vs 100% more for two separate cores. Two harts share execution resources while maintaining completely independent execution state.

Is CLINT shared by all harts on a core? (asked as correction)

No — CLINT is shared across the entire SoC. It is a single MMIO peripheral on the system bus, physically outside all cores. It has per-hart register slots (msip[N], mtimecmp[N]) but all those slots live in one device. Any hart on any core can read/write any slot — that is how cross-hart IPIs work: hart 0 on core 0 writes msip[3] to interrupt hart 3 on core 1.

Long form of CLINT and Sstc

CLINT = Core Local Interruptor. "Core local" because each hart has dedicated register slots, though the hardware itself is SoC-wide.

Sstc = Supervisor-mode Standard Timer Comparison extension. Adds stimecmp CSR so S-mode can directly arm its own timer without an SBI ecall. OpenSBI enables via MENVCFG.STCE if detected at runtime.

Prefix	Meaning	Examples
Sx	Supervisor standard	Sstc, Svpbmt, Svadu
Smx	Machine standard	Smepmp, Smaia
Zx	Standard sub-extension	Zicbom, Zicboz, Zkr
Xx	Vendor/non-standard	Xsifivecease

Does OpenSBI support running on multiple cores and harts at SoC level?

Yes. OpenSBI is designed for this. All harts start simultaneously, amoswap lottery elects coldboot hart. Coldboot initialises scratch for ALL harts. All call sbi_init() — one coldboot path, others warmboot. All mret to payload with a0=hartid, a1=DTB.

SBI extensions for SMP: HSM (SBI_EXT_HSM — hart_start/stop/suspend), IPI (SBI_EXT_IPI — cross-hart interrupts), RFENCE (cross-hart TLB/I-cache shootdowns). Linux uses HSM to bring up CPUs one by one after the boot CPU initialises the kernel.

How does multiprocessing work — atomics, spinlocks, memory ordering, scheduling, context switch, work stealing?

Hardware Atomics

AMO (amoswap, amoadd, amoand…): indivisible read-modify-write in one bus transaction. LR/SC: lr.w sets reservation; sc.w succeeds only if reservation not cancelled. .aq=acquire ordering, .rl=release ordering.

Spinlocks (built on amoswap)

Try to swap 1 in; if old=0 → lock acquired; else read until 0 (avoid bus contention) then retry. Release: amoswap 0 with .rl.

Memory Ordering (RVWMO)

Stores visible out-of-order across harts. Producer: fence w,w between data write and flag write. Consumer: fence r,r between flag read and data read. AMO .aq/.rl suffixes embed ordering into the atomic itself.

Scheduling and Context Switch

Per-hart run queues (low contention). Timer interrupt (~1ms) preempts tasks. Empty queue → steal half tasks from busiest hart. Context switch: save all 32 regs + sepc + sstatus + satp → restore → sfence.vma → sret.

What does volatile uint64_t timer_count[NUM_HARTS] IN_DATA = {0,0,0,0} mean?

volatile — always read/write actual memory; compiler must never cache in a register. Necessary because multiple harts modify this concurrently and the compiler cannot see that from any single hart's perspective.

uint64_t — 64-bit unsigned integer per element. Matches RV64 natural register width.

timer_count[NUM_HARTS] — array of 4. Each element: how many times that hart's timer has fired. Each hart writes only its own slot (timer_count[hartid]++) so no spinlock needed for this array.

IN_DATA = __attribute__((section(".data"))) — forces into .data section (not .bss). Our runtime BSS clear had a GP-relative addressing bug. Forcing to .data means the ELF loader initialises values before _start runs — no runtime clear needed.

= {0,0,0,0} — initial values baked into the ELF binary, copied by the ELF loader before any instruction runs.

Can RFENCE (fence rw,rw) fix the LR/SC livelock?

No. fence rw,rw controls memory ordering — when a hart's own stores become visible to others. It cannot stop other harts from independently accessing memory.

The LR/SC livelock was a cache line contention problem: primary's loads of hart_ready[] were generating MESI coherency transactions on the same cache line as uart_lock, cancelling secondary's LR reservation. The fence cannot prevent other harts from running their own load instructions independently. Only amoswap fixes it — an indivisible AMO has no reservation gap that can be cancelled.

PMP & Memory Protection (6 questions)

Detailed PMP — CSR format, address modes, check algorithm, OpenSBI code

CSR Format

pmpcfg0–15: each holds 8 config bytes (one per entry). Each byte: L(7)=lock, WPRI(6:5)=reserved, A(4:3)=address mode, X(2)=execute, W(1)=write, R(0)=read. pmpaddr0–63: stores PA[55:2] (address right-shifted by 2). Always write pmpaddr before pmpcfg to avoid activating with wrong address.

Address Modes (A field)

Mode	Value	Range	Use
OFF	00	Disabled	Unused entries
TOR	01	[prev×4, this×4)	Arbitrary ranges, needs 2 entries
NA4	10	Exactly 4 bytes	Single register
NAPOT	11	Power-of-2 aligned	Most common — pmpaddr=(base>>2)\|((size/8)-1)

Check Algorithm

Walk entries 0→63. First match wins. Match + check R/W/X against access type. M-mode bypasses unless L=1. No match: M-mode passes, S/U-mode = access fault (1=fetch, 5=load, 7=store).

OpenSBI PMP Entry 0 and Entry 1 have the same base address — what is happening?

Intentional priority layering. Entry 0 (128KB, DENY) sits inside Entry 1's larger range (256KB, RX). First match wins, so Entry 0 blocks the sensitive 128KB while Entry 1 allows the remaining 128KB.

0x80000000–0x8001FFFF → Entry 0 matches first → DENY (firmware R/W data protected)
0x80020000–0x8003FFFF → Entry 0 no match → Entry 1 → RX  (firmware code readable)

Standard PMP technique for "allow X, except deny this smaller sub-region."

What is the difference between Domain and Region in PMP?

Region = single hardware PMP entry: one address range + one permission set. The CPU knows only about flat PMP entries — checked in order on every memory access.

Domain = OpenSBI software concept: a named security partition owning a set of harts, a collection of memory regions, a next-boot address/mode, and allowed SBI extensions. OpenSBI programs PMP entries to enforce the domain's regions.

Domain (OpenSBI policy) → Region (software struct) → PMP Entry (hardware CSR)

Simple systems have one domain (root, all harts). Multi-tenant systems use multiple domains on different harts with different PMP rules enforcing hardware isolation between workloads.

Why does each hart initialise its own scratch/stack in S-mode when OpenSBI already did it in M-mode?

They are two completely separate scratch spaces at different addresses.

	OpenSBI (M-mode)	Your payload (S-mode)
Location	0x80040000–0x8005FFFF	0x80200Cxx–0x80201Cxx
PMP	DENY to S-mode	S-mode accessible
Scratch CSR	MSCRATCH (M-mode)	SSCRATCH (S-mode)
Stack pointer	OpenSBI's sp	Your payload's sp
Used by	sbi_init, trap handlers	primary_main, secondary_main

Both exist simultaneously. When an ecall switches S→M, hardware saves S-mode sp and switches to M-mode's stack. They are two independent execution contexts on the same physical CPU.

What does the I (Idempotent) flag mean in PMP regions?

Idempotent = doing something twice gives the same result as once. For memory reads: reading the same address twice returns the same value and changes nothing in the hardware.

Normal RAM is always idempotent. A UART receive buffer is NOT — reading consumes the byte (first read = 'H', second read = empty). The I flag tells the memory subsystem: reads here are safe to speculate, prefetch, or cache-check without triggering accidental side effects.

Regions NOT marked I must be accessed with strict ordering — no speculation or reordering allowed. On QEMU, Region02 (syscon 0x00100000) is idempotent — reads return a fixed value, writes trigger poweroff (0x5555) or reboot (0x7777).

PMP Regions from the OpenSBI terminal output — what does each one mean?

Region	Address	S/U Access	Meaning
Region00	0x80040000–0x8005FFFF	DENY	OpenSBI data/scratch/stacks — protected from S-mode
Region01	0x80000000–0x8003FFFF	R (via R06)	OpenSBI code — S-mode can read but not write
Region02	0x00100000–0x00100FFF	R W	Syscon test device — I=idempotent, writes trigger poweroff/reboot
Region03	0x02000000–0x0200FFFF	DENY	CLINT — timer and IPI hardware, M-mode only via SBI
Region04	0x0C400000–0x0C5FFFFF	R W	PLIC S-mode context registers — direct access for payload
Region05	0x0C000000–0x0C3FFFFF	R W	PLIC source/priority/enable registers — direct access
Region06	0x0–0xFFFF...	R W X	Catch-all — grants S/U access to everything not covered above

Cache Memory (6 questions)

Detailed cache memory organisation, types, policies, RISC-V mechanisms

Organisation

Address split: TAG | INDEX | OFFSET. 32KB 4-way: offset=6b(64-byte lines), index=7b(128 sets), tag=43b(for 56-bit PA). N-way set associative: index→set→N parallel tag comparisons. Direct-mapped=1 way (conflict misses). Fully assoc=1 set (no conflicts, expensive).

Hierarchy

L1-I (per hart, fetch only, not coherent with D) + L1-D (per hart, load/store, MESI coherent) + L2 (shared per core) + L3 (shared all cores).

Write Policies

Write-back (default): update cache only, writeback on eviction. Write-through: update both cache and memory on every write. Write-allocate: on write miss, fetch line then update (pairs with write-back). LRU/PLRU/Random replacement when evicting.

RISC-V Mechanisms

fence.i: flush local I-cache. fence rw,rw: memory ordering. sfence.vma: TLB flush. cbo.clean (Zicbom): writeback dirty for DMA. cbo.inval: invalidate for DMA. cbo.zero (Zicboz): zero a cache line without fetching (~8× faster than sd loop). All gated by MENVCFG from M-mode.

Define set and way in cache architecture

Way = one complete cache slot: TAG + valid bit + dirty bit + 64 bytes of data. Like one parking space.

Set = a group of ways sharing the same index address. When the CPU accesses memory, index bits select one set and ALL ways in that set are checked simultaneously (parallel tag comparison). Like a row of parking spaces at the same address.

           Way 0      Way 1      Way 2      Way 3
Set 0  [ tag|data ][ tag|data ][ tag|data ][ tag|data ]
Set 1  [ tag|data ][ tag|data ][ tag|data ][ tag|data ]
  ...
Set127 [ tag|data ][ tag|data ][ tag|data ][ tag|data ]

On access: index bits → select row (set) → compare all N tags → HIT returns data / MISS fetches from L2 into one way of that set (chosen by replacement policy).

How is PA=56 bits derived?

It is a hardware design decision, not derived. The CPU designer chooses PA width based on: maximum RAM the chip needs to address, pin count, power budget. RISC-V specifies up to 56-bit PA for Sv39/Sv48/Sv57 paging modes. Real chips implement fewer: SiFive U74=56, StarFive JH7110=40, embedded=32.

Discover at runtime by writing all-ones to SATP.PPN and reading back how many bits stuck. In cache calculations: PA=56 is given as a specification → tag = 56 − index_bits − offset_bits. The tag must be wide enough that no two different PAs map to identical tag+index.

Explain the classic matrix transpose conflict miss example

1024×1024 float matrix, row=4096 bytes, 32KB direct-mapped cache, line=64B, 512 sets. Index bits=PA[14:6]. Addresses 32KB apart map to the same set.

The thrash: Reading A[row][col] is sequential (fine). Writing B[col][row] jumps one row (4096B) per column. If base_A and base_B are 32KB apart: A[row][col] and B[col][row] map to the same cache set. Load A evicts B, load B evicts A — every access misses → 300× slower.

Fixes: Higher associativity (4-way: 4 addresses coexist in one set — eliminates most conflicts). Cache tiling: work on 32×32 submatrices fitting in cache so A and B tiles coexist without eviction.

I-cache vs D-cache — elaborate on "NOT coherent with D-cache on writes"

L1-I and L1-D are physically separate SRAM arrays. Writing new code through D-cache does not propagate to I-cache automatically.

sd t0, 0x80001000   → new instruction in D-cache
                      I-cache at same address: STILL OLD CODE
jalr ra, 0x80001000 → I-cache HIT: executes OLD code → WRONG

With fence.i: (1) flush dirty D-cache lines to L2, (2) invalidate all I-cache lines, (3) flush pipeline. Next fetch goes to L2 and retrieves new code.

Cross-hart: fence.i only affects the LOCAL hart. For other harts: fence rw,rw to make D-write visible + send IPI via msip + target hart executes fence.i. OpenSBI provides this via SBI_EXT_RFENCE.

Does fence.i actually invalidate the I-cache when it executes?

Yes — but implementation-defined. The spec guarantees the outcome (subsequent fetches see prior stores), not the method.

Most common (SiFive U74): Full I-cache flush — all valid bits → 0. Cost ~15–30 cycles.
Selective: Some cores invalidate only matching lines (faster, more hardware).
No I-cache (embedded): Just a pipeline flush.

Write-back order matters: dirty D-cache lines must flush to L2 before I-cache invalidation — otherwise the I-cache miss would fetch stale data from L2. Correct order: flush dirty D → invalidate I → flush pipeline → resume fetch.

On QEMU: fence.i calls tb_flush() — discards all JIT translation blocks, forces re-translation from fresh memory.

SMP Boot & Multiprocessing (5 questions)

How does OpenSBI booting work at memory level across all harts?

T0  Power-on. All 4 harts: PC=0x1000 (MROM). _boot_lottery=0.

T1  All harts race to _start. amoswap on _boot_lottery.
    Bus serialises 4 simultaneous requests. One hart reads
    back 0 (winner). Memory: _boot_lottery → 1.
    Other 3 harts fall through to warmboot spin loop.

T2  Boot hart writes: GOT relocation, BSS zeros,
    ALL FOUR scratch structs (fw_start, fw_size,
    next_addr=0x80200000, next_mode=S, platform ptr).
    All writes dirty in boot hart's L1-D.

T3  fence rw,rw → drains store buffer → all writes
    globally visible via MESI. _boot_status=1 →
    MESI invalidates Shared line in warmboot harts →
    they miss on next load → read 1 → exit spin.

T4  All harts in _start_warm. Each computes own tp
    (no conflict — different addresses). All csrw MSCRATCH=tp.

T5  sbi_init() C-level atomic → coldboot/warmboot split.
    Coldboot: full domain, HSM, ecall, banner init.
    Warmboot: wait on coldboot_done flag.

T6  wake_coldboot_harts() → fence → coldboot_done=1.
    Warmboot harts exit → each does per-hart init.

T7  All harts: sbi_hart_switch_mode() →
    MEPC=0x80200000, MSTATUS.MPP=S, mret.

T8  All 4 harts at payload _start, S-mode.
    a0=hartid, a1=DTB. Your code runs.

Was Project 3 (opensbi_payload) using only 1 hart?

Yes. Project 3 had no -smp flag → QEMU default = 1 CPU, 1 hart. OpenSBI banner: Platform HART Count: 1. Only hart 0 ran. Warmboot path never taken. Payload entry.S had no multi-hart handling — one stack, one call to payload_main().

Project 4 added -smp cpus=4,cores=2,threads=2 → 4 harts → all four arrive at _start simultaneously with non-deterministic boot hart.

How does multiprocessing work at memory level — from atomics through work stealing?

Layer 1 — AMO Instructions

amoswap/amoadd/amoand: indivisible read-modify-write in one bus transaction. LR/SC: reservation-based; cancelled if another hart touches the same cache line between LR and SC.

Layer 2 — Spinlocks

amoswap 1 in → old=0 means acquired, old=1 means wait. Read until 0 (avoids bus-locking contention while spinning), then retry. Release: amoswap 0 with .rl ordering. OpenSBI uses spinlocks for heap (sbi_malloc) and IPI queues.

Layer 3 — IPI Protocol

Hart A: writes message to shared memory + fence rw,rw + writes msip[B]=1 to CLINT. CLINT raises MSIP on hart B. Hart B's trap handler processes message + writes msip[B]=0 to clear.

Layer 4 — Scheduling + Work Stealing

Per-hart run queues (minimal cross-hart contention). Timer interrupt (~1ms) preempts. Empty queue → steal half of tasks from busiest hart. Context switch: save 32 regs + sepc + sstatus + satp → restore incoming task → sfence.vma → sret.

Project 4 design — why these specific choices? (atomic lottery, HSM, amoswap, padded uart_lock, wfi wait)

Design Choice	Why
Atomic lottery for primary	OpenSBI boot hart is non-deterministic. Lottery ensures any hart can win and the code works correctly on every run.
HSM to start secondaries	Secondaries that lost lottery are in wfi with no stack. HSM gives them a clean S-mode entry with correct registers at _secondary_entry.
amoswap for UART lock	LR/SC reservations are cancelled by adjacent cache line accesses. amoswap is indivisible — cannot be interrupted between read and write.
uart_lock in padded struct	Ensures uart_lock occupies its own 64-byte cache line so hart_ready[] accesses cannot interfere with the spinlock.
All shared vars in .data	Runtime BSS clear had a GP-relative bug. .data means ELF loader initialises correctly before _start runs.
wfi in wait loops	Tight spin on hart_ready[] shared uart_lock's cache line, cancelling LR/SC reservations. wfi yields CPU and eliminates bus traffic.
primary_hartid variable	Primary can be any hart. Secondaries need to know which hart to IPI to wake it from wfi — cannot hardcode hart 0.

Please explain in simple language: primary tight-spinning on hart_ready[] cancels secondary's LR/SC reservation on uart_lock

Imagine a shared whiteboard in an office. Person A (secondary) wants to write on it and places a sticky note: "A is about to write — do not touch." That sticky note is the LR reservation.

Person B (primary) is not writing — they just keep walking past the same whiteboard every second to check a timetable pinned next to A's sticky note. Every time B touches that corner of the board, A's sticky note falls off. A puts a new one down. B walks past again. Falls off again. A can never write. Forever.

In hardware: uart_lock (at 0x80200b10) and hart_ready[] (at 0x80200b18) were 8 bytes apart — same 64-byte cache line. Every lw from primary loading hart_ready[] generated a MESI coherency transaction on that cache line, cancelling secondary's LR reservation. sc.w failed every retry.

The raw sb (direct store) still worked because it needs no reservation. The amoswap fix works because it does both read and write in one indivisible bus transaction — no gap where another hart can interfere.

Project 4 — SMP Debugging Journey (6 questions)

Bug 1 — BSS clear loop using uart_lock address instead of _bss_end (GP-relative addressing bug)

Root cause: __global_pointer$ in linker.ld enabled GP-relative addressing. GCC emits lw a5, offset_from_GP(gp) for global variable access. The assembler, seeing that uart_lock happened to be at the same address as _bss_end (both 0x80200b60), substituted uart_lock's GP-relative address as the loop bound. The BSS clear started at hart_ready (_bss_start) and stopped at uart_lock (_bss_end) — but uart_lock was in that range and was NEVER zeroed. It contained garbage → uart_acquire() spinlock hung forever on first call.

Fix: Use callee-saved registers s2/s3 with explicit la for BSS loop bounds (assembler cannot substitute GP-relative). Force all shared vars to .data — ELF loader initialises them; no runtime BSS clear needed.

Bug 2 — uart_acquire LR/SC livelock (cache line contention)

Root cause: uart_lock and hart_ready[] shared the same 64-byte cache line (uart_lock at 0x80200b10, hart_ready at 0x80200b18 — only 8 bytes apart). Primary's tight spin on hart_ready[] generated continuous MESI coherency transactions on that cache line, cancelling secondary's LR reservation on uart_lock before sc.w could complete. Livelock — secondary could never acquire the UART lock.

Proof: Raw sb to UART printed 'A' (no reservation needed). hprint() hung (needs LR/SC lock). The diagnostic showed exactly which layer failed.

Fix: (1) Wrap uart_lock in a struct with 60-byte padding, aligned(64) → own cache line. (2) Replace LR/SC with amoswap.w.aq (indivisible). (3) Replace tight spin with wfi in wait_all_harts_ready.

Bug 3 — payload only worked when Boot HART ID=0, hung otherwise

Root cause: Code had beqz s0, _primary — only hart 0 became primary. When OpenSBI's boot hart was 1/2/3, that hart took the coldboot path and arrived at _start. Hart 0 was a warmboot hart, held in OpenSBI's sbi_hsm_hart_wait() waiting for an HSM hart_start call. Nobody called it — hart 0 never reached _start at all. The other harts' boot diagnostics (digits at _start) also did not print, confirming they never arrived.

Fix: Atomic lottery — amoswap.w.aq on _boot_lottery. Whichever hart arrives first becomes primary regardless of hartid. Primary then HSM-starts all other harts at _secondary_entry. This is how Linux boots secondary CPUs.

Bug 4 — secondaries crash after printing 'A' but before hprint (GP not initialised)

Root cause: GCC uses the GP (global pointer) register for GP-relative addressing of global variables. When OpenSBI delivered secondary harts to _secondary_entry via HSM, GP contained whatever OpenSBI left in it — pointing into firmware memory, not payload .data. The first global variable access (amoswap on &uart_lock in uart_acquire) used the garbage GP address → store access fault.

Fix: Add GP initialisation in _secondary_entry before any C call:

.option push
.option norelax      ← critical: prevents assembler from optimising away la gp
la gp, __global_pointer$
.option pop

Must also add the same in _primary. The same bug existed in the previous (pre-lottery) version but was hidden because only hart 0 ran primary and hart 0 happened to have a usable GP from OpenSBI.

Bug 5 — primary's timer never fired when primary was not hart 0

Root cause: secondary_main had hardcoded send_ipi_to_hart(0). When primary was hart 1/2/3, all three secondaries IPId hart 0 (a secondary) instead of primary. Primary stayed stuck in wait_all_harts_ready() forever because nobody IPId it. Never woke up, never enabled its timer, never reached the event loop. Timer count for primary stayed at 0. Demo never completed.

Fix: Add volatile uint64_t primary_hartid shared variable. Primary writes its own hartid with a fence rw,rw before HSM-starting secondaries (guarantees secondaries can read it safely). Secondaries call send_ipi_to_hart(primary_hartid).

Project 4 final working output — what does each line mean?

[Primary] Winning hart: N     ← N = whichever hart won the amoswap lottery (any 0–3)
[Primary] SBI spec version: 3.0 ← confirmed via SBI_EXT_BASE ecall
[Primary] Starting all other harts via HSM...
[Primary] HSM start hart X → OK  ← sbi_hsm_hart_start() returned error=0 for each
[Hart X] Secondary started.       ← secondary reached secondary_main, acquired UART lock
[Hart N] IPI received #1          ← secondary sent IPI to wake primary from wfi
[Primary] All 4 harts running.    ← all hart_ready[] flags set, wait loop exited
[Hart N] Starting timer...        ← primary armed its timer via sbi_set_timer()
[Hart N] Timer IRQ #1             ← trap_handler fired for STIP (scause=0x8000...0005)
[Hart N]   mtime = 0x...          ← current CLINT time counter value
[Hart N]   shared_ctr = 1        ← atomic amoadd.d.aqrl incremented shared_counter
... (12 timer IRQs total, 3 per hart, staggered by N+1 intervals)
[Primary] Sending IPI to all secondary harts...  ← primary sent SSIP to harts 1,2,3
[Hart X] IPI received #1         ← SSIP delivered (scause=0x8000...0001), csrc sip cleared
[Hart X] Secondary done. Halting. ← while() loop exited, hart enters infinite wfi
  shared_counter = 12  (expected 12)  ← 4 harts × 3 fires = 12, atomic counter correct
  Demo complete. System halted.