AI Interpretability Researcher
current Jun 2025 – Present- Conduct AI safety research on hidden misalignment in language models, studying whether deceptive behavior can be detected through internal activation patterns.
- Build activation-based diagnostic tools using PyTorch, Hugging Face model hooks, and the bench-af framework to extract and classify model representations.
- Train linear and non-linear probes on honest vs. deceptive model outputs, analyzing layer-wise patterns and cross-model generalization across GPT-2, HAL9000, and LLaMA-3-70B.