Prepared Datasets
General Purpose Datasets
knowledgator/GLINER-multi-task-synthetic-data
A synthetic dataset powering GLiNER's multi-task capabilities. Engineered to handle diverse NLP challenges in a unified framework.
Supported Tasks:
- Named Entity Recognition: Extract and categorize entities (names, organizations, dates)
- Relation Extraction: Identify relationships between text entities
- Text Summarization: Extract key sentences capturing essential information
- Sentiment Analysis: Detect positive, negative, or neutral sentiment regions
- Key-Phrase Extraction: Identify important phrases and keywords
- Question Answering: Locate answers within text given questions
- Open Information Extraction: Extract content based on custom user prompts
{
"total_examples": 48548,
"unique_entities": 50094
}
tner/tweebank_ner
Social media NER dataset from the TNER project, optimized for Twitter-style text processing with informal language patterns.
{
"total_examples": 1639,
"unique_entities": 4,
"entities": ["location", "organization", "other", "person"]
}
thunlp/docred
Document-level relation extraction dataset built from Wikipedia and Wikidata. Designed for complex inter-sentence relationship understanding across entire documents.
Key Features:
- Multi-sentence context processing
- Complex entity relationship inference
- Large-scale distantly supervised data available
{
"total_examples": 3053,
"unique_entities": 6,
"entities": [
"location", "numerical entity", "organization",
"other", "person", "time"
]
}
Multilingual Dataset Collection
Cross-lingual datasets supporting diverse languages and cultural contexts.
tner/multinerd
First language-agnostic methodology for creating multilingual, multi-genre NER annotations. Covers 10 languages with fine-grained entity categories.
Languages: Chinese, Dutch, English, French, German, Italian, Polish, Portuguese, Russian, Spanish
Genres: Wikipedia articles, WikiNews content
{
"total_examples": 2283360,
"unique_entities": 15,
"entities": [
"animal", "biological entity", "celestial body", "disease",
"event", "food", "institution", "location", "media",
"mythological entity", "organization", "person", "plant",
"time", "vehicle"
]
}
MultiCoNER/multiconer_v2
Large-scale multilingual NER dataset addressing contemporary challenges including low-context scenarios and syntactically complex entities.
Languages: Bangla, Chinese, English, Spanish, Farsi, French, German, Hindi, Italian, Portuguese, Swedish, Ukrainian
Domains: Wiki sentences, questions, search queries
{
"total_examples": 154046,
"unique_entities": 33,
"entities": [
"aerospace manufacturer", "anatomical structure", "art work",
"athlete", "car manufacturer", "cleric", "clothing", "disease",
"drink", "facility", "food", "human settlement", "location",
"medical procedure", "medication/vaccine", "musical group",
"organization", "person", "politician", "product", "software",
"sports group", "vehicle", "visual work", "written work"
]
}
\ TODO: add link to this kg dataset
Synthetic Multilingual Dataset
Synthetic multilingual dataset created using Qwen3-4B model annotations on HuggingFaceFW/fineweb-2 content.
{
"total_examples": 41197,
"unique_entities": 27123
}
unimelb-nlp/wikiann
WikiANN (PAN-X) multilingual NER dataset spanning 176 languages with balanced train/dev/test splits.
{
"total_examples": 678900,
"unique_entities": 3,
"entities": ["location", "organization", "person"]
}
Biomedical Dataset Collection
Specialized datasets for medical, biological, and chemical entity recognition.
bigbio/anat_em
Extended Anatomical Entity Mention corpus with 250K+ words manually annotated for anatomical entities using 12 granularity-based categories.
{
"total_examples": 606,
"unique_entities": 12,
"entities": [
"anatomical system", "cancer", "cell", "cellular component",
"developing anatomical structure", "immaterial anatomical entity",
"multi", "organ", "organism subdivision", "organism substance",
"pathological formation", "tissue"
]
}
bigbio/drugprot
DrugProt corpus featuring expert-labeled chemical and gene mentions with biologically relevant relationships from BioCreative VII.
{
"total_examples": 3500,
"unique_entities": 3,
"entities": ["chemical", "gene", "other"]
}
bigbio/muchmore
Parallel English-German medical abstracts corpus from 41 medical journals, each representing distinct medical sub-domains.
{
"total_examples": 7822,
"unique_entities": 2,
"entities": ["other", "umlsterm"]
}
bigbio/chia
Large annotated corpus of patient eligibility criteria from 1,000 Phase IV clinical trials with comprehensive entity and relationship annotations.
{
"total_examples": 2000,
"unique_entities": 17,
"entities": [
"condition", "device", "drug", "measurement", "mood",
"multiplier", "negation", "observation", "other", "person",
"procedure", "qualifier", "reference point", "scope",
"temporal", "value", "visit"
]
}
bigbio/bioinfer
Protein, gene, and RNA relationship corpus with 1,100 sentences annotated for relationships, entities, and syntactic dependencies.
{
"total_examples": 894,
"unique_entities": 5,
"entities": [
"gene", "individual protein", "other",
"protein complex", "protein family or group"
]
}
bigbio/scai_disease
Disease and adverse effects dataset with 400 MEDLINE abstracts annotated by life sciences Masters degree holders.
{
"total_examples": 400,
"unique_entities": 2,
"entities": ["adverse", "disease"]
}
bigbio/nlm_gene
Comprehensive gene recognition corpus with 550 PubMed articles covering 28 organisms and 15K+ unique gene names.
{
"total_examples": 450,
"unique_entities": 3,
"entities": ["gene", "gene or gene product", "other"]
}
bigbio/seth_corpus
SNP (Single Nucleotide Polymorphism) named entity recognition corpus from 630 PubMed citations.
{
"total_examples": 630,
"unique_entities": 4,
"entities": ["gene", "other", "rs", "snp"]
}
bigbio/ctebmsp
Spanish clinical trials corpus with 500 Creative Commons licensed abstracts from evidence-based medicine studies.
{
"total_examples": 300,
"unique_entities": 5,
"entities": ["anatomy", "chemical", "disorder", "other", "process"]
}
bigbio/mirna
MicroRNA corpus with 301 Medline citations manually annotated for gene, disease, and miRNA entities.
{
"total_examples": 201,
"unique_entities": 7,
"entities": [
"diseases", "genes", "non", "other",
"relation trigger", "species", "specific mirnas"
]
}
bigbio/gnormplus
Enhanced gene corpus combining BioCreative II GN and Citation GIA datasets with added gene family and protein domain annotations.
{
"total_examples": 432,
"unique_entities": 4,
"entities": ["Gene", "domain motif", "family name", "other"]
}
bigbio/osiris
MEDLINE abstracts manually annotated for human genetic variation mentions under Creative Commons licensing.
{
"total_examples": 105,
"unique_entities": 2,
"entities": ["gene", "variant"]
}