Self-Recognition in Language Models
Nov 1, 2024·,,,,,·
0 min read
Tim R. Davidson
Viacheslav Surkov
Veniamin Veselovsky
Giuseppe Russo
Robert West
Caglar Gulcehre

Abstract
A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated “security questions.” Our test can be externally administered without access to internal model parameters or output probabilities. We apply this test to ten of the most capable open- and closed-source LMs available. Our experiments find no empirical evidence of consistent self-recognition. Instead, models tend to select the “best” answer, independent of authorship. We further find that preferences for certain model outputs are consistent across LMs, and we uncover new insights about position bias in multiple-choice tasks.
Type
Publication
In Findings of the Association for Computational Linguistics–EMNLP 2024