Self-Recognition in Language Models

Nov 1, 2024·

Tim R. Davidson

Viacheslav Surkov

Veniamin Veselovsky

Giuseppe Russo

Robert West

Caglar Gulcehre

· 0 min read

PDF Cite DOI

Abstract

A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated “security questions.” Our test can be externally administered without access to internal model parameters or output probabilities. We apply this test to ten of the most capable open- and closed-source LMs available. Our experiments find no empirical evidence of consistent self-recognition. Instead, models tend to select the “best” answer, independent of authorship. We further find that preferences for certain model outputs are consistent across LMs, and we uncover new insights about position bias in multiple-choice tasks.

Type

Conference paper

Publication

In Findings of the Association for Computational Linguistics–EMNLP 2024

Last updated on Nov 1, 2024

Language Models AI Safety Model Evaluation Identity Verification EMNLP

Stranger Danger! Cross-Community Interactions with Fringe Users Increase the Growth of Fringe Communities on Reddit May 28, 2024 →