Bias correction in LLMs

payne · March 19, 2023, 7:13pm

“When the researchers simply told the model not to rely on stereotypes or social biases — literally by typing in those instructions — the model was less biased in its predictions and responses. This suggests that some emergent properties might also be used to reduce bias. In a paper released in February, the Anthropic team reported on a new “moral self-correction” mode, in which the user prompts the program to be helpful, honest and harmless.”

payne · March 19, 2023, 7:14pm

The paper: [2302.07459] The Capacity for Moral Self-Correction in Large Language Models