IBM is open sourcing an efficient AI model for detecting hateful, abusive, and profane text in LLM training data and LLM-generated text
Large language models inevitably pick up foul language during pre-training, in the process of ingesting text taken from the internet. But profanity, as well as hateful and abusive speech, can also creep in later, during fine-tuning or at inference time, when LLMs interact with people in the real world.
Ideally, each stage of the generative AI pipeline would include checks for hate speech, abusive language, and profanity, or HAP, as researchers refer to it. That would be feasible if only HAP filters were quicker than they are today.
IBM’s new filter granite-guardian-hap-38m is built for speed: small enough to run on a CPU, and quick enough to filter data at each phase of the LLM lifecycle, from pre-processing to inferencing. IBM just open sourced the 38-million parameter encoder model on Hugging Face.
“Extensive checks are needed to make sure no toxic language slips through,” said Yang Zhao, an IBM research scientist focused on AI “It’s only really practical to do extensive checking if your HAP filter is lightweight and fast.”
In addition to granite-guardian-hap-38m, IBM is open sourcing its close cousin, granite-guardian-hap-125m. In internal testing, the larger, 125 million parameter model outperformed HAP detectors of comparable size, including Meta’s Dynabench HAP filter, on a suite of toxic-language benchmarks.
To improve the model further, IBM researchers set out to shrink its size. Smaller LLMs perform fewer computations, reducing the cost and carbon emissions associated with training and deploying them.
With the help of something called neural architecture search, researchers transferred the big model’s knowledge to a compact format with eight fewer layers of artificial neurons. Granite-guardian-hap-38m was born.
In internal testing, the 38-million parameter model ran eight times faster than its larger cousin on a CPU, and about twice as fast on a GPU, researchers found.