Skip to main content

Comprehend-it

Comprehend-it is a versatile NLP model developed by Knowledgator, designed for zero-shot and few-shot learning across various information extraction tasks. Built upon the DeBERTaV3-base architecture, it has been extensively trained on natural language inference (NLI) and diverse text classification datasets, enabling it to outperform larger models like BART-large-MNLI while maintaining a smaller footprint.


Overview

  • Architecture: Based on DeBERTaV3-base.
  • Model Size: 184 million parameters.
  • Input Capacity: Up to 3,000 tokens.
  • Languages Supported: English.

Supported Information Extraction Tasks

  • Text Classification
  • Reranking of Search Results
  • Named Entity Recognition (NER)
  • Relation Extraction
  • Entity Linking
  • Question Answering

Benchmarking

Zero-shot F1 scores on various datasets:

ModelIMDBAG_NEWSEmotions
BART-large-MNLI0.890.68870.3765
DeBERTa-base-v30.850.64550.5095
Comprehend-it0.900.79820.5660
SetFit BAAI/bge-small-en-v1.50.860.56360.5754

Note: All models were evaluated in a zero-shot setting without fine-tuning.


Usage

Zero-Shot Classification with Hugging Face Pipeline

from transformers import pipeline

classifier = pipeline("zero-shot-classification",
model="knowledgator/comprehend_it-base")

sequence = "One day I will see the world"
labels = ['travel', 'cooking', 'dancing']

result = classifier(sequence, labels)
print(result)

Multi-Label Classification

labels = ['travel', 'cooking', 'dancing', 'exploration']
result = classifier(sequence, labels, multi_label=True)
print(result)

Manual Inference with PyTorch

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained('knowledgator/comprehend_it-base')
tokenizer = AutoTokenizer.from_pretrained('knowledgator/comprehend_it-base')

premise = "One day I will see the world"
hypothesis = "This example is travel."

inputs = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = model(inputs).logits

entail_contradiction_logits = logits[:, [0, 2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:, 1]
print(prob_label_is_true)

Question Answering

Transform QA tasks into a multi-choice format:

question = "What is the capital city of Ukraine?"
candidate_answers = ['Kyiv', 'London', 'Berlin', 'Warsaw']
result = classifier(question, candidate_answers)
print(result)

Few-Shot Fine-Tuning with LiqFit

Install LiqFit via pip:

pip install liqfit

from liqfit.modeling import LiqFitModel
from liqfit.losses import FocalLoss
from liqfit.collators import NLICollator
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained('knowledgator/comprehend_it-base')
backbone_model = AutoModelForSequenceClassification.from_pretrained('knowledgator/comprehend_it-base')
loss_func = FocalLoss()

model = LiqFitModel(backbone_model.config, backbone_model, loss_func=loss_func)
data_collator = NLICollator(tokenizer, max_length=128, padding=True, truncation=True)

training_args = TrainingArguments(
output_dir='comprehend_it_model',
learning_rate=3e-5,
per_device_train_batch_size=3,
per_device_eval_batch_size=3,
num_train_epochs=9,
weight_decay=0.01,
evaluation_strategy="epoch",
save_steps=5000,
save_total_limit=3,
remove_unused_columns=False,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=nli_train_dataset,
eval_dataset=nli_test_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
)

trainer.train()