Unveiling the Power of LLM Decontaminator in Language Models

Imagine preparing a student for an exam by inadvertently providing them with all the answers beforehand. It sounds like a guaranteed success, right? However, when the real test comes, this student is unprepared to tackle questions they've never seen. This scenario mirrors a significant issue in the world of Language Learning Models (LLMs) known as 'contamination.'

Contamination occurs when a language model, like a diligent student, is exposed to the test questions (or something very similar) during its training phase. The result? An inflated sense of the model's ability to generalize and solve new problems.

Objective and Solution: The LLM Decontaminator

A simple but effective solution to this problem is the LLM Decontaminator¹. This tool is designed to identify and remove these 'pre-learned' test samples from the training data, ensuring that our LLM, like a properly trained student, truly understands the material and can apply it to novel situations.

How Does It Work?

1. Building the Dataset: The first step involves creating a dataset of test samples and their closest training samples using a simpler Sentence Transformer.

# Code snippet for building dataset
def build_database(train_cases, test_cases, output_path, top_k=5):
    model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
    train_embs = bert_encode(model, train_cases)
    test_embs = bert_encode(model, test_cases)
    # top_k_similarity returns top k train indices for each test sample
    top_k_indices = top_k_similarity(test_embs, train_embs, top_k)
    db = []

    for i, test_case in enumerate(test_cases):
        top_k_cases = [train_cases[index] for index in top_k_indices[i]]
        db.append({"test": test_case, "train": top_k_cases})      

    return db

2. Detecting Contamination: Next, we use GPT-4 to determine if a test sample and a training sample are essentially the same question rephrased. This is akin to asking a student if two exam questions are asking the same thing, just in different words.

# Code snippet for detecting contamination
def detect_contamination(openai_model, question1, question2):
    knowledge_instruct ="""I will now give you two questions. I will enclose the two questions with curly braces \{\}.
        Please help me determine if the following two questions are the same.
        Disregard the names and minor changes in word order that appear within.
        If they are, please answer 'True', otherwise answer 'False'. Do not respond with anything else.
        If their question prompts are very similar and, without considering the solution process, they produce the same answer, we consider them to be the same question.
        """
    prompt = "part1: \{\n" + question1 + "\n\}\npart2: \{\n" + question2 + "\n\}"
    retries = 0
    while retries < 30:
        completion = client.chat.completions.create(
                    model=openai_model,
                    messages=[
                        {"role": "system", "content": knowledge_instruct},
                        {"role": "user", "content": prompt}
                    ],
                    timeout=3,
                    temperature=0.3,
                )
        pred = completion.choices[0].message.content

        if pred == "True":
            return True
        elif pred == "False":
            return False
        retries += 1
    return False

Complete working code is available in github²

Results and Impact

The use of this LLM Decontaminator has shown remarkable results. By applying it to real world training datasets, we can detect and eliminate significant overlaps with widely used benchmarks. This means we're not just training our LLM to pass the test, but to truly understand and generalize its knowledge.

Conclusion: A Leap Forward in Language Model Training

In conclusion, the LLM Decontaminator represents a significant step forward in the training of language models. By ensuring that our models are learning in a contamination-free environment, we're paving the way for more reliable, effective, and genuinely intelligent language processing tools.

This development marks not just a technical achievement, but a commitment to the integrity and advancement of AI language understanding. It's a promise that our models will be as ready for the unpredictable nature of real-world language as a well-prepared student is for the twists and turns of a challenging exam.

Special thanks to the original authors of the LLM Decontaminator paper and tool. For further details, I highly recommend visiting their paper and the GitHub repository linked in the references.

LLM Decontaminator: https://arxiv.org/abs/2311.04850 ↩
Decontamination tool: https://github.com/lm-sys/llm-decontaminator ↩

Last update: November 18, 2023