Transcript: Hello. This is the official video for “Can Large Language Models (or Humans) Disentangle Text?,” published as a short paper at the NLP and CSS Workshop at NAACL in 2024. First, I want to give some motivation and context. This work is about the task of disentangling variables from text. Sometimes text is contaminated by a variable that we wish to remove or control for. For instance, if we are using text to do causal inference, we might want the text to be independent from certain variables. If we are using text as training data, we might want to remove personal information from it for fairness or ethical reasons. Approaches in the past have focused on interventions at the text embedding or representation level. These approaches have often succeeded but require labeled data and result in transformations that are not necessarily interpretable by humans. Given the advances in large language models in the last few years, we asked if we can use them to directly rewrite text to remove a particular target variable while preserving everything else. Additionally, we wanted to see how humans perform on the same task. In terms of representation disentanglement, prior work often focuses on learning a guarding function that takes a text representation and renders it independent of some target variable. In our case, we wanted to take raw text and transform it into other raw text so that it is independent of the target variable, while being minimally intrusive. A trivial way would be to just return an empty string, but that obviously removes all information. We want to remove only the variable of interest. Our high-level goal is to use large language models to do this rewriting and make the text independent from the target variable, while preserving everything else. For our experiments, we used a dataset of Amazon reviews, two thousand of them. Each has two labels: a binary sentiment label and a topic label from one of six possible topics. We chose the sentiment as the target variable because it is fairly challenging-sentiment information is spread out in the text. We tested two language models: the Mistral 7B and GPT-4. For each model, we had three prompt strategies. The first was a control strategy that just asked the large language model to rewrite the text as closely as possible. The second was a few-shot strategy, where we asked the model to rewrite the review while removing any sentiment information, giving a few examples of how to do it. The third strategy was prompt chaining with two stages: first identifying the parts of the text that contain sentiment, then asking the model to remove those passages and rewrite the text. We also did two comparison experiments. One used a classic representation-level method, the mean projection, and the other tested humans performing the same rewriting task to try to remove sentiment from the text. In our experimental setup, we started with original reviews and trained two classifiers on them: one for sentiment and one for topic. Then, we passed the reviews to a large language model to obtain Rewritten reviews. We trained the same two classifiers on the Rewritten text and compared performance. A successful rewriting would achieve near-chance accuracy for the sentiment classifier, since the text should no longer contain sentiment signals, while preserving the topic classifier’s performance. We found that the mean projection method, which operates on embeddings, did succeed at removing sentiment signal. The classifier on the transformed embeddings was close to chance for sentiment accuracy, and the topic classifier was unaffected. However, our large language model approach did not do so well. The best we managed was a drop from 88.5% original sentiment accuracy down to 76%, which is still above the 50% chance level. Topic accuracy remained well preserved, but we could not fully remove the sentiment signal. Interestingly, humans also struggled to rewrite the reviews to remove sentiment, achieving about 80% accuracy on the sentiment classifier. Here are some numbers: the original sentiment accuracy was about 88.5%. The topic accuracy was about 95%. Our best rewriting method that actually rewrote text-rather than operating on embeddings-reduced sentiment accuracy only to around 76%. The implications are that large language models and humans seem not able to fully strip sentiment information out of text when rewriting it. This suggests that sentiment is thoroughly baked into the text. The representation-level approach was successful, but does not necessarily provide a meaningful or interpretable text output. This is a cautionary note for interpretability and fairness claims when transformations happen at the embedding level. It may successfully remove the targeted information, but the resulting representations do not translate clearly back to human-readable text. In future work, it might be interesting to test more advanced prompt or rewriting strategies, or try different tasks. Maybe with personal information, which is often localized, the rewriting would be easier. Another direction would be to see what happens if two variables are more dependent on each other. That might make removing one while retaining the other even more difficult. In conclusion, we see that current large language models do not reliably remove sentiment traces with these simple methods. Humans also struggle, indicating that it might just be a hard task. Some limitations to note: some variables, such as localized personal information, might be easier to remove. Also, we relied on machine learning-based classifiers to evaluate our success. It might be interesting to see if a human could classify the sentiment in the rewritten text as accurately as a machine classifier. That is the end of the talk. If you want more details, please read the paper or get in touch with the authors. Thank you for listening.
Transcript:
Hello. This is the official video for “Can Large Language Models (or Humans) Disentangle Text?,” published as a short paper at the NLP and CSS Workshop at NAACL in 2024.
First, I want to give some motivation and context. This work is about the task of disentangling variables from text. Sometimes text is contaminated by a variable that we wish to remove or control for. For instance, if we are using text to do causal inference, we might want the text to be independent from certain variables. If we are using text as training data, we might want to remove personal information from it for fairness or ethical reasons.
Approaches in the past have focused on interventions at the text embedding or representation level. These approaches have often succeeded but require labeled data and result in transformations that are not necessarily interpretable by humans. Given the advances in large language models in the last few years, we asked if we can use them to directly rewrite text to remove a particular target variable while preserving everything else. Additionally, we wanted to see how humans perform on the same task.
In terms of representation disentanglement, prior work often focuses on learning a guarding function that takes a text representation and renders it independent of some target variable. In our case, we wanted to take raw text and transform it into other raw text so that it is independent of the target variable, while being minimally intrusive. A trivial way would be to just return an empty string, but that obviously removes all information. We want to remove only the variable of interest.
Our high-level goal is to use large language models to do this rewriting and make the text independent from the target variable, while preserving everything else.
For our experiments, we used a dataset of Amazon reviews, two thousand of them. Each has two labels: a binary sentiment label and a topic label from one of six possible topics. We chose the sentiment as the target variable because it is fairly challenging-sentiment information is spread out in the text.
We tested two language models: the Mistral 7B and GPT-4. For each model, we had three prompt strategies. The first was a control strategy that just asked the large language model to rewrite the text as closely as possible. The second was a few-shot strategy, where we asked the model to rewrite the review while removing any sentiment information, giving a few examples of how to do it. The third strategy was prompt chaining with two stages: first identifying the parts of the text that contain sentiment, then asking the model to remove those passages and rewrite the text.
We also did two comparison experiments. One used a classic representation-level method, the mean projection, and the other tested humans performing the same rewriting task to try to remove sentiment from the text.
In our experimental setup, we started with original reviews and trained two classifiers on them: one for sentiment and one for topic. Then, we passed the reviews to a large language model to obtain Rewritten reviews. We trained the same two classifiers on the Rewritten text and compared performance.
A successful rewriting would achieve near-chance accuracy for the sentiment classifier, since the text should no longer contain sentiment signals, while preserving the topic classifier’s performance.
We found that the mean projection method, which operates on embeddings, did succeed at removing sentiment signal. The classifier on the transformed embeddings was close to chance for sentiment accuracy, and the topic classifier was unaffected.
However, our large language model approach did not do so well. The best we managed was a drop from 88.5% original sentiment accuracy down to 76%, which is still above the 50% chance level. Topic accuracy remained well preserved, but we could not fully remove the sentiment signal.
Interestingly, humans also struggled to rewrite the reviews to remove sentiment, achieving about 80% accuracy on the sentiment classifier.
Here are some numbers: the original sentiment accuracy was about 88.5%. The topic accuracy was about 95%. Our best rewriting method that actually rewrote text-rather than operating on embeddings-reduced sentiment accuracy only to around 76%.
The implications are that large language models and humans seem not able to fully strip sentiment information out of text when rewriting it. This suggests that sentiment is thoroughly baked into the text. The representation-level approach was successful, but does not necessarily provide a meaningful or interpretable text output.
This is a cautionary note for interpretability and fairness claims when transformations happen at the embedding level. It may successfully remove the targeted information, but the resulting representations do not translate clearly back to human-readable text.
In future work, it might be interesting to test more advanced prompt or rewriting strategies, or try different tasks. Maybe with personal information, which is often localized, the rewriting would be easier. Another direction would be to see what happens if two variables are more dependent on each other. That might make removing one while retaining the other even more difficult.
In conclusion, we see that current large language models do not reliably remove sentiment traces with these simple methods. Humans also struggle, indicating that it might just be a hard task.
Some limitations to note: some variables, such as localized personal information, might be easier to remove. Also, we relied on machine learning-based classifiers to evaluate our success. It might be interesting to see if a human could classify the sentiment in the rewritten text as accurately as a machine classifier.
That is the end of the talk. If you want more details, please read the paper or get in touch with the authors. Thank you for listening.