💻 Module 6 · Software Stack · Chapter 6.1 · 11 min read

OS and PCIe Driver Basics

Linux's first touch on SIDRA hardware — driver architecture.

What you'll learn here

  • Understand the Linux kernel-module vs user-space split
  • State the PCIe enumeration process and the BAR (Base Address Register) concept
  • Connect character device files (/dev/sidra0) with the software API
  • Explain the interrupt vs polling design choice
  • Identify the 5 layers of the SIDRA software stack (driver → firmware → SDK → compiler → app)

Hook: Hardware Without Software Is Inert

Module 5 finished SIDRA hardware. But a bare chip is useless. The software stack makes it usable:

[Application]      → user code (PyTorch, app)
[SDK / API]        → SIDRA-specific library
[Compiler]         → PyTorch model → SIDRA assembly
[Driver]           → Linux kernel module (PCIe communication)
[Firmware]         → SIDRA on-chip RISC-V
[Hardware]         → YILDIRIM chip

This module covers each layer across 10 chapters. This chapter is the bottom: driver + OS basics.

Intuition: Linux + PCIe + Device File

Linux hardware model:

User-space (apps) can’t touch hardware directly. A kernel interface (system call) is needed.

For PCIe devices:

  1. BIOS performs PCIe enumeration at boot (vendor + device ID).
  2. Linux kernel matches the device with a driver.
  3. Driver exposes the hardware to user-space as a /dev/sidra0 device file.
  4. App interacts via read()/write()/ioctl().

SIDRA driver responsibilities:

  • Initialize the chip (boot calibration).
  • Set up memory maps (BARs).
  • Allocate DMA buffers.
  • Handle interrupts (inference complete).
  • Provide IOCTLs for software control.

Formalism: PCIe Driver Details

L1 · Başlangıç

PCIe BAR (Base Address Register):

The chip declares its memory regions. Linux assigns physical addresses.

Y1 BAR layout:

  • BAR0: 16 MB MMIO control registers.
  • BAR1: 256 MB DMA buffer area.
  • BAR2: 16 GB L3 SRAM mapping.

User-space code can mmap() the BARs and read/write directly.

Kernel module skeleton (Linux):

static int sidra_probe(struct pci_dev *dev, const struct pci_device_id *id) {
    pci_enable_device(dev);
    pci_request_regions(dev, "sidra");
    bar0 = ioremap(pci_resource_start(dev, 0), pci_resource_len(dev, 0));
    request_irq(dev->irq, sidra_interrupt_handler, ...);
    cdev_add(...);  // /dev/sidra0
    return 0;
}

static struct pci_driver sidra_driver = {
    .name = "sidra",
    .id_table = sidra_ids,
    .probe = sidra_probe,
    .remove = sidra_remove,
};

Module init: module_init(sidra_driver_init);. insmod sidra.ko to install.

Device file:

/dev/sidra0 for user-space access. Open:

fd = open("/dev/sidra0", O_RDWR);
L2 · Tam

Inference flow (driver perspective):

  1. App makes an SDK call → driver IOCTL.
  2. Driver copies input data to the DMA buffer.
  3. Driver writes a command to the chip (BAR0 register).
  4. Chip starts inference (hardware).
  5. Inference completes → chip raises an interrupt.
  6. Driver handles the interrupt, reads output from DMA buffer.
  7. Returns to the app.

Typical latency:

  • IOCTL overhead: ~5 µs.
  • DMA setup: ~1 µs.
  • Chip inference: ~5 µs.
  • Interrupt handle: ~10 µs.
  • Total: ~21 µs/inference.

Polling alternative:

Driver polls a register instead of waiting for interrupts. Lower latency (no interrupt handler), but CPU is busy.

SIDRA Y1: polling for small inferences, interrupts for big batches. Hybrid.

DMA mapping:

Linux dma_alloc_coherent() allocates a buffer that’s coherent for both device and CPU. No cache flush issues.

Y1 DMA ring buffer: 4 MB (enough for ~1000 inferences).

L3 · Derin

Kernel ↔ user-space communication:

3 ways:

  1. read/write: classic file syscalls. Slow (copy).
  2. mmap: BAR mapped into user-space. Zero-copy.
  3. ioctl: control commands (e.g. “load model”, “start inference”).

The SIDRA SDK uses all three together.

Multi-process:

Several apps may want SIDRA. Driver arbitrates.

Methods:

  • Time-sharing (apps take turns).
  • Hardware partitioning (clusters split among apps).
  • Virtual SIDRA (each app sees its own “small” SIDRA, hardware shared).

Y1 uses time-sharing. Y10 will add virtualization.

ABI stability:

Driver API must be stable (old apps work with new drivers). SIDRA driver versioning: SIDRA_API_V1.0; growth V2.0 adds features but keeps V1 supported.

Security:

User-space shouldn’t directly send chip commands (malicious risk). Driver enforces:

  • Linux capabilities (CAP_SYS_RAWIO).
  • /dev/sidra0 permissions (root or sidra group).

Inference needs basic perms; programming (model load) needs sudo.

Error handling:

Chip overheats, ECC failure, etc. → driver raises an event. App receives a callback.

sysfs exposes status: /sys/class/sidra/sidra0/temperature, /error_count.

Module 6 roadmap:

  • 6.1 (this): PCIe driver basics.
  • 6.2: Linux kernel and aether-driver.
  • 6.3: SIDRA on-chip RISC-V firmware.
  • 6.4: ISPP algorithm.
  • 6.5: SDK layers.
  • 6.6: PyTorch backend.
  • 6.7: Compiler.
  • 6.8: Digital twin / simulator.
  • 6.9: Test, calibration, verification.
  • 6.10: End-to-end production stack lab.

Experiment: SIDRA Driver Hello World

Scenario: open “/dev/sidra0”, read the hardware version.

#include <fcntl.h>
#include <sys/ioctl.h>
#include <stdio.h>

#define SIDRA_IOCTL_VERSION _IOR('s', 1, int)

int main() {
    int fd = open("/dev/sidra0", O_RDWR);
    if (fd < 0) {
        perror("open");
        return 1;
    }
    
    int version;
    if (ioctl(fd, SIDRA_IOCTL_VERSION, &version) < 0) {
        perror("ioctl");
        return 1;
    }
    
    printf("SIDRA hardware version: 0x%08x\n", version);
    close(fd);
    return 0;
}

Compile: gcc -o sidra_hello sidra_hello.c.

Run: ./sidra_hello.

Output: SIDRA hardware version: 0x00010000 (Y1 v1.0).

The driver catches this IOCTL, reads the version from BAR0, returns to user-space. Latency ~10 µs.

SDK alternative (higher level):

import sidra
chip = sidra.Chip(0)
print(f"Version: {chip.version}, Memristors: {chip.memristor_count}")

Same thing, cleaner API.

Quick Quiz

1/6Where does the SIDRA driver live in Linux?

Lab Exercise

Plan a SIDRA Y1 inference benchmark.

Scenario:

  • Y1 PCIe 5.0 × 4 = 16 GB/s.
  • Inference time (hardware): 5 µs/MNIST.
  • Driver overhead: 20 µs.
  • Total: 25 µs/inference.

Questions:

(a) How many inferences/second? (b) For batch 32, time + bandwidth? (c) Will PCIe be a bottleneck? (d) Optimal batch size?

Solutions

(a) 1/25 µs = 40K inferences/second.

(b) Batch 32 input: 32 × 28×28 = 25 KB. Transfer: 25 KB / 16 GB/s = 1.6 µs. Inference parallel batch 32 → ~100 µs. Total: 100 + 20 = 120 µs/batch = 267K inferences/s (6.7× speedup with batching).

(c) PCIe 16 GB/s × 1 ms = 16 MB. Inference 25 KB × 40K = 1 MB/s. PCIe has ample headroom. Bottleneck is the driver/inference itself.

(d) Optimum batch = balance hardware vs bandwidth. For Y1, 16-32 is ideal. Above 64, driver overhead amortizes but inference slows down (parallel capacity limit).

Cheat Sheet

  • Software stack 5 layers: firmware, driver, SDK, compiler, app.
  • Linux driver: kernel module, PCIe enumerate, BAR map, /dev/sidra0.
  • User-space access: open/read/write/ioctl, mmap.
  • Inference latency: hardware 5 µs + driver 20 µs = 25 µs.
  • DMA buffer: 4 MB ring, autonomous device transfer.
  • Polling vs interrupt: Y1 hybrid.

Vision: SIDRA Beyond Linux?

  • Y1: Linux + Windows driver.
  • Y3: Android + iOS support (mobile).
  • Y10: RTOS support (embedded, automotive).
  • Y100: SIDRA-native OS (from Module 5.1’s vision).
  • Y1000: Bio-compatible OS (brain-implant interface).

For Türkiye: upstreaming SIDRA to the Linux kernel = international contribution. Turkish developers get visibility for SIDRA via mainline kernel patches.

Further Reading