Membership Inference Attacks on Fine-Tuned Large Language Models
Membership Inference Attack (MIA) aims to determine whether a given sample was part of a target model's training data. Most existing MIAs against fine-tuned large language models rely on strong assumptions. We propose a MIA method, ECT-MIA, which exploits the memory of LLMs for entities and their contextual texts. We guide a reference model to learn from the samples to be detected, and then distinguish members from non-members by comparing the deviations in outputs between the target model and the reference model, specifically on entities and entity-related texts within the target samples. Our method operates as a purely black-box attack, requiring only access to the plain text outputs of the target model—without relying on internal parameters or any data with a distribution similar to the target model's training dataset.