Sounds like a programming error to me.
LLMs believe false statements even after explicit warnings that they’re false
Imagine a kid who grows up reading history books where every page is stamped “WARNING: THIS BOOK IS LYING.” You’d expect them to come away skeptical, or at least uncertain. New research on so-called “negation neglect” finds that LLMs in a roughly analogous situation don’t behave that way. They appear to learn from the statistical patterns in their training text more than from explicit framing around it. Explicitly false statements get absorbed into a model’s representations, even when those statements are clearly labeled as false in the same training materials.
No comments:
Post a Comment