Safety Research
Popular repositories Loading
-
persona_vectors
persona_vectors PublicPersona Vectors: Monitoring and Controlling Character Traits in Language Models
-
-
-
assistant-axis
assistant-axis PublicThe Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarr…
-
safety-tooling
safety-tooling PublicInference API for many LLMs and other useful tools for empirical research
Repositories
- faithful-cot Public
Code for "Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning"
safety-research/faithful-cot’s past year of commit activity - legibility Public
Which models are illegible under what conditions, and why? How does that impact monitorability?
safety-research/legibility’s past year of commit activity - persona_vectors Public
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
safety-research/persona_vectors’s past year of commit activity - aligning-ai-orgs Public
safety-research/aligning-ai-orgs’s past year of commit activity
Top languages
Loading…
Most used topics
Loading…