OS and PCIe Driver Basics
Linux's first touch on SIDRA hardware — driver architecture.
Prerequisites
What you'll learn here
- Understand the Linux kernel-module vs user-space split
- State the PCIe enumeration process and the BAR (Base Address Register) concept
- Connect character device files (/dev/sidra0) with the software API
- Explain the interrupt vs polling design choice
- Identify the 5 layers of the SIDRA software stack (driver → firmware → SDK → compiler → app)
Hook: Hardware Without Software Is Inert
Module 5 finished SIDRA hardware. But a bare chip is useless. The software stack makes it usable:
[Application] → user code (PyTorch, app)
[SDK / API] → SIDRA-specific library
[Compiler] → PyTorch model → SIDRA assembly
[Driver] → Linux kernel module (PCIe communication)
[Firmware] → SIDRA on-chip RISC-V
[Hardware] → YILDIRIM chipThis module covers each layer across 10 chapters. This chapter is the bottom: driver + OS basics.
Intuition: Linux + PCIe + Device File
Linux hardware model:
User-space (apps) can’t touch hardware directly. A kernel interface (system call) is needed.
For PCIe devices:
- BIOS performs PCIe enumeration at boot (vendor + device ID).
- Linux kernel matches the device with a driver.
- Driver exposes the hardware to user-space as a
/dev/sidra0device file. - App interacts via
read()/write()/ioctl().
SIDRA driver responsibilities:
- Initialize the chip (boot calibration).
- Set up memory maps (BARs).
- Allocate DMA buffers.
- Handle interrupts (inference complete).
- Provide IOCTLs for software control.
Formalism: PCIe Driver Details
PCIe BAR (Base Address Register):
The chip declares its memory regions. Linux assigns physical addresses.
Y1 BAR layout:
- BAR0: 16 MB MMIO control registers.
- BAR1: 256 MB DMA buffer area.
- BAR2: 16 GB L3 SRAM mapping.
User-space code can mmap() the BARs and read/write directly.
Kernel module skeleton (Linux):
static int sidra_probe(struct pci_dev *dev, const struct pci_device_id *id) {
pci_enable_device(dev);
pci_request_regions(dev, "sidra");
bar0 = ioremap(pci_resource_start(dev, 0), pci_resource_len(dev, 0));
request_irq(dev->irq, sidra_interrupt_handler, ...);
cdev_add(...); // /dev/sidra0
return 0;
}
static struct pci_driver sidra_driver = {
.name = "sidra",
.id_table = sidra_ids,
.probe = sidra_probe,
.remove = sidra_remove,
};Module init: module_init(sidra_driver_init);. insmod sidra.ko to install.
Device file:
/dev/sidra0 for user-space access. Open:
fd = open("/dev/sidra0", O_RDWR); Inference flow (driver perspective):
- App makes an SDK call → driver IOCTL.
- Driver copies input data to the DMA buffer.
- Driver writes a command to the chip (BAR0 register).
- Chip starts inference (hardware).
- Inference completes → chip raises an interrupt.
- Driver handles the interrupt, reads output from DMA buffer.
- Returns to the app.
Typical latency:
- IOCTL overhead: ~5 µs.
- DMA setup: ~1 µs.
- Chip inference: ~5 µs.
- Interrupt handle: ~10 µs.
- Total: ~21 µs/inference.
Polling alternative:
Driver polls a register instead of waiting for interrupts. Lower latency (no interrupt handler), but CPU is busy.
SIDRA Y1: polling for small inferences, interrupts for big batches. Hybrid.
DMA mapping:
Linux dma_alloc_coherent() allocates a buffer that’s coherent for both device and CPU. No cache flush issues.
Y1 DMA ring buffer: 4 MB (enough for ~1000 inferences).
Kernel ↔ user-space communication:
3 ways:
- read/write: classic file syscalls. Slow (copy).
- mmap: BAR mapped into user-space. Zero-copy.
- ioctl: control commands (e.g. “load model”, “start inference”).
The SIDRA SDK uses all three together.
Multi-process:
Several apps may want SIDRA. Driver arbitrates.
Methods:
- Time-sharing (apps take turns).
- Hardware partitioning (clusters split among apps).
- Virtual SIDRA (each app sees its own “small” SIDRA, hardware shared).
Y1 uses time-sharing. Y10 will add virtualization.
ABI stability:
Driver API must be stable (old apps work with new drivers). SIDRA driver versioning: SIDRA_API_V1.0; growth V2.0 adds features but keeps V1 supported.
Security:
User-space shouldn’t directly send chip commands (malicious risk). Driver enforces:
- Linux capabilities (CAP_SYS_RAWIO).
- /dev/sidra0 permissions (root or sidra group).
Inference needs basic perms; programming (model load) needs sudo.
Error handling:
Chip overheats, ECC failure, etc. → driver raises an event. App receives a callback.
sysfs exposes status: /sys/class/sidra/sidra0/temperature, /error_count.
Module 6 roadmap:
- 6.1 (this): PCIe driver basics.
- 6.2: Linux kernel and aether-driver.
- 6.3: SIDRA on-chip RISC-V firmware.
- 6.4: ISPP algorithm.
- 6.5: SDK layers.
- 6.6: PyTorch backend.
- 6.7: Compiler.
- 6.8: Digital twin / simulator.
- 6.9: Test, calibration, verification.
- 6.10: End-to-end production stack lab.
Experiment: SIDRA Driver Hello World
Scenario: open “/dev/sidra0”, read the hardware version.
#include <fcntl.h>
#include <sys/ioctl.h>
#include <stdio.h>
#define SIDRA_IOCTL_VERSION _IOR('s', 1, int)
int main() {
int fd = open("/dev/sidra0", O_RDWR);
if (fd < 0) {
perror("open");
return 1;
}
int version;
if (ioctl(fd, SIDRA_IOCTL_VERSION, &version) < 0) {
perror("ioctl");
return 1;
}
printf("SIDRA hardware version: 0x%08x\n", version);
close(fd);
return 0;
}Compile: gcc -o sidra_hello sidra_hello.c.
Run: ./sidra_hello.
Output: SIDRA hardware version: 0x00010000 (Y1 v1.0).
The driver catches this IOCTL, reads the version from BAR0, returns to user-space. Latency ~10 µs.
SDK alternative (higher level):
import sidra
chip = sidra.Chip(0)
print(f"Version: {chip.version}, Memristors: {chip.memristor_count}")Same thing, cleaner API.
Quick Quiz
Lab Exercise
Plan a SIDRA Y1 inference benchmark.
Scenario:
- Y1 PCIe 5.0 × 4 = 16 GB/s.
- Inference time (hardware): 5 µs/MNIST.
- Driver overhead: 20 µs.
- Total: 25 µs/inference.
Questions:
(a) How many inferences/second? (b) For batch 32, time + bandwidth? (c) Will PCIe be a bottleneck? (d) Optimal batch size?
Solutions
(a) 1/25 µs = 40K inferences/second.
(b) Batch 32 input: 32 × 28×28 = 25 KB. Transfer: 25 KB / 16 GB/s = 1.6 µs. Inference parallel batch 32 → ~100 µs. Total: 100 + 20 = 120 µs/batch = 267K inferences/s (6.7× speedup with batching).
(c) PCIe 16 GB/s × 1 ms = 16 MB. Inference 25 KB × 40K = 1 MB/s. PCIe has ample headroom. Bottleneck is the driver/inference itself.
(d) Optimum batch = balance hardware vs bandwidth. For Y1, 16-32 is ideal. Above 64, driver overhead amortizes but inference slows down (parallel capacity limit).
Cheat Sheet
- Software stack 5 layers: firmware, driver, SDK, compiler, app.
- Linux driver: kernel module, PCIe enumerate, BAR map, /dev/sidra0.
- User-space access: open/read/write/ioctl, mmap.
- Inference latency: hardware 5 µs + driver 20 µs = 25 µs.
- DMA buffer: 4 MB ring, autonomous device transfer.
- Polling vs interrupt: Y1 hybrid.
Vision: SIDRA Beyond Linux?
- Y1: Linux + Windows driver.
- Y3: Android + iOS support (mobile).
- Y10: RTOS support (embedded, automotive).
- Y100: SIDRA-native OS (from Module 5.1’s vision).
- Y1000: Bio-compatible OS (brain-implant interface).
For Türkiye: upstreaming SIDRA to the Linux kernel = international contribution. Turkish developers get visibility for SIDRA via mainline kernel patches.
Further Reading
- Next chapter: 6.2 — Linux Kernel and aether-driver
- Previous module: 5.15 — Thermal and Packaging Deep Dive
- Linux device drivers: Corbet, Rubini, Linux Device Drivers, 3rd ed., O’Reilly.
- PCIe spec: PCI-SIG.
- Kernel hacking: kernel.org documentation.