SAFi: A Self-Alignment Framework for Verifi able Runtime Governance of Large Language Models

Nelson Amaya

SAFi: A Self-Alignment Framework for Verifi able Runtime Governance of Large Language Models

Abstract

The deployment of powerful Large Language Models (LLMs) in high-stakes domains presents a critical challenge: ensuring reliable adherence to behavioral constraints at runtime. Existing alignment techniques, primarily focused on pre-deployment training, often fail to prevent model drift or rule violations in live, interactive environments. This paper introduces SAFi (Self-Alignment Framework Interface), a novel, closed-loop framework for the runtime governance of LLMs. SAFi is structured around four distinct faculties, Intellect, Will, Conscience, and Spirit, that separate content generation from rule validation, enabling a continuous cycle of generation, verifi cation, auditing, and adaptation. The framework's key innovation is a stateful, adaptive memory, managed by the mathematical Spirit faculty, which allows the system to be aware of its own performance and correct for behavioral drift over time. We present the results of two empirical benchmark studies comparing a SAFi-governed LLM against a standalone baseline in the high-stakes domains of fi nance and healthcare. The results demonstrate that SAFi achieves almost 100% adherence to its confi gured safety rules, whereas the baseline model exhibits catastrophic failures. We conclude that runtime governance frameworks like SAFi are an essential component for building demonstrably safe and reliable AI agents.

Cite

Plain text

BibTeX

Formatted text

Zotero

EndNote

Reference Manager

RefWorks

Options

Edit

Mark as duplicate

Find it on Scholar

Request removal from index

Revision history

View on PhilPapers

Archival history

Archival date: 2025-09-25
View all versions

Keywords

AI, Faculty Psychology, Aquinas, Aristotle

Reprint years

Analytics

Added to PP
2025-09-26

Downloads
279 (#104,805)

6 months
279 (#20,756)

Historical graph of downloads since first upload

This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.

How can I increase my downloads?

Applied ethics	Epistemology	History of Western Philosophy	Meta-ethics	Metaphysics	Normative ethics
Philosophy of biology	Philosophy of language	Philosophy of mind	Philosophy of religion	Science Logic and Mathematics	More ...

SAFi: A Self-Alignment Framework for Verifi able Runtime Governance of Large Language Models

Abstract

Archival history

Categories

Keywords

Reprint years

Analytics