Components & Configs

GLiNERConfig ^[source]

The configuration class used to define the architecture and behavior of a GLiNER model. It inherits from PretrainedConfig.

This class is used to control key architectural aspects, including the encoder, label encoder, span representation strategy, and additional fusion or RNN layers.

Parameters

`model_name`

str, optional, defaults to "microsoft/deberta-v3-small"

Base encoder model identifier from Hugging Face Hub or local path.

`labels_encoder`

str, optional

Encoder model to be used for embedding label texts. Can be a model ID or local path.

`name`

str, optional, defaults to "span level gliner"

Optional display name for this model configuration.

`max_width`

int, optional, defaults to 12

Maximum span width (in number of tokens) allowed when generating candidate spans.

`hidden_size`

int, optional, defaults to 512

Dimensionality of hidden representations in internal layers.

`dropout`

float, optional, defaults to 0.4

Dropout rate applied to intermediate layers.

`fine_tune`

bool, optional, defaults to True

Whether to fine-tune the encoder during training.

`subtoken_pooling`

str, optional, defaults to "first"

Strategy used to pool subword token embeddings.
Choices: "first", "mean", "last"

`span_mode` ^[source]

str, optional, defaults to "markerV0"

Type: str — optional, defaults to "markerV0"
Defines the strategy for constructing span representations from encoder outputs.

Available options:

"markerV0" — Projects the start and end token representations with MLPs, concatenates them, and then applies a final projection. Lightweight and default.
"marker" — Similar to markerV0 but with deeper two-layer projections; better for complex tasks.
"query" — Uses learned per-span-width query vectors and dot-product interaction.
"mlp" — Applies a feedforward MLP and reshapes output into span format; fast but position-agnostic.
"cat" — Concatenates token features with learned span width embeddings before projection.
"conv_conv" — Uses multiple 1D convolutions with increasing kernel sizes; captures internal structure.
"conv_max" — Max pooling over tokens in span; emphasizes the strongest token.
"conv_mean" — Mean pooling across span tokens.
"conv_sum" — Sum pooling; raw additive representation.
"conv_share" — Shared convolution kernel over span widths; parameter-efficient alternative.

Read more

`post_fusion_schema` ^[source]

str, optional, defaults to ""

Defines the multi-step attention schema used to fuse span and label embeddings. The value is a string with hyphen-separated tokens that determine the sequence of attention operations applied in the CrossFuser module.

Each token in the schema defines one of the following attention types:

"l2l" — label-to-label self-attention (intra-label interaction)
"t2t" — token-to-token self-attention (intra-span interaction)
"l2t" — label-to-token cross-attention (labels attend to span tokens)
"t2l" — token-to-label cross-attention (tokens attend to labels)

Examples:

"l2l-l2t-t2t" — apply label self-attention → label-to-token attention → token self-attention
"l2t" — a single step where labels attend to span tokens
"" — disables fusion entirely (no interaction is applied)

The number and order of operations affect both performance and computational cost.

tip

The number of fusion layers (num_post_fusion_layers) controls how many times the entire schema is repeated.

`num_post_fusion_layers`

int, optional, defaults to 1

Number of layers applied after span-label fusion.

`vocab_size`

int, optional, defaults to -1

Vocabulary size override if needed (e.g. for decoder components).

`max_neg_type_ratio`

int, optional, defaults to 1

Controls the ratio of negative (non-matching) types during training.

`max_types`

int, optional, defaults to 25

Maximum number of entity types supported.

`max_len`

int, optional, defaults to 384

Maximum sequence length accepted by the encoder.

`words_splitter_type`

str, optional, defaults to "whitespace"

Heuristic used for word-level splitting during inference.

`has_rnn`

bool, optional, defaults to True

Whether to apply an LSTM on top of encoder outputs.

`fuse_layers`

bool, optional, defaults to False

If True, combine representations from multiple encoders (labels and main encoder).

`embed_ent_token`

bool, optional, defaults to True

If True, <<ENT>> tokens will be pooled for each label, if False, the first token of each label will be pooled as label embedding.

`class_token_index`

int, optional, defaults to -1

Index of the classification token in the encoder (e.g. [CLS]). Set -1 if unused.

`encoder_config`

dict or PretrainedConfig, optional

A nested config dictionary for the encoder model. If a dict is passed, its model_type must be set or inferred.

`labels_encoder_config`

dict or PretrainedConfig, optional

Same as encoder_config, but used to configure the label encoder.

`ent_token`

str, optional, defaults to "<<ENT>>"

Token used to mark entity span boundaries.

`sep_token`

str, optional, defaults to "<<SEP>>"

Token used to separate entities or fields in span input.

`_attn_implementation`

any, optional

Optional override for attention logic (used in advanced configurations or experimentation).

Examples:

Could be used to turn off flash-attention if flashdeberta or flash-attn is installed and model supports Flash Attention.

model = GLiNER.from_pretrained("urchade/gliner_mediumv2.1", _attn_implementation="eager")

Examples

Initiate model from config

from gliner import GLiNERConfig, GLiNER

config = GLiNERConfig.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
model = GLiNER(config)

TrainingArguments ^[source]

Custom extension of transformers.TrainingArguments with additional parameters for span-based models, focal loss control, and parameter-specific optimization.

Parameters

`cache_dir`

str, optional
Directory to store cache files. If specified, the model and tokenizer would be loaded form local cache_dir.

`optim`

str, optional, defaults to "adamw_torch"
Optimizer name used during training.

`others_lr`

float, optional
Overrides learning rate for non-encoder parameters (e.g. label encoder, token_rep_layer). Used to create separate optimizer groups in create_optimizer.

`others_weight_decay`

float, optional, defaults to 0.0
Weight decay used for non-encoder parameters. Only applied if others_lr is specified.

`focal_loss_alpha`

float, optional, defaults to -1
Alpha for focal loss. If ≥ 0, focal loss is activated.

Focal loss formula:
FL(pₜ) = -α × (1 - pₜ)^γ × log(pₜ)

`focal_loss_gamma`

float, optional, defaults to 0
Gamma for focal loss. Amplifies effect of hard examples.

`label_smoothing`

float, optional, defaults to 0.0
Smoothing factor ε for regularizing classification targets.

Smoothed label formula:
yᵢ_smooth = (1 - ε) × yᵢ + ε / N
Where:

ε is the label smoothing factor
N is the number of classes

`loss_reduction`

str, optional, defaults to "sum"
Specifies how loss is aggregated.
Choices: "sum", "mean"

`negatives`

float, optional, defaults to 1.0
Ratio of negative to positive spans during training. Controls sampling balance.

`masking`

str, optional, defaults to "global"
Controls masking strategy for spans.
Choices:

"global" — fixed mask
"softmax" — attention-based masking
"none" — no masking

Training Configuration ^[source]

Custom config.yaml used to control model initialization, architecture, and training behavior.

Parameters

`model_name`

str, required
Base encoder model identifier from Hugging Face Hub.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

Example: "microsoft/deberta-v3-small"

`labels_encoder`

str, optional
Model used to encode label descriptions. Base encoder model identifier from Hugging Face Hub.

Note

If "prev_path" is not null, and the model is bi-encoder ("labels_encoder" is not null), will be implicitly inherited from the model at "prev_path" model.

`name`

str, optional
Optional display name for tracking or logging.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`max_width`

int, optional, defaults to 12
Maximum span width (in tokens) considered during candidate generation.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`hidden_size`

int, optional
Dimensionality of hidden layers and span projections. Should match the encoder's output.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`dropout`

float, optional, defaults to 0.4
Dropout rate applied to internal projection layers.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`fine_tune`

bool, optional, defaults to true
Whether to fine-tune encoder weights during training.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`subtoken_pooling`

str, optional, defaults to "first"
Pooling strategy for subword embeddings.
Choices: "first", "mean", "last"

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`fuse_layers`

bool, optional, defaults to false
Whether to use additional attention fusion layers after span-label interaction.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`post_fusion_schema`

str, optional, defaults to ""
Schema string for defining multi-step span-label fusion (e.g., "l2l-l2t-t2t").

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`span_mode`

str, optional, defaults to "markerV0"
Method for representing spans. See Span Representation Layers.

Note

If "prev_path" is not null, will be implicitly inherited from the model at "prev_path" model.

`num_steps`

int, required
Total number of training steps.

`train_batch_size`

int, required
Training batch size per step.

`eval_every`

int, optional
Evaluate the model every N steps.

`warmup_ratio`

float, optional
Fraction of total steps used for LR warm-up.

`scheduler_type`

str, optional, defaults to "cosine"
Learning rate scheduler to use.

`loss_alpha`

float, optional
Alpha parameter for focal loss. If ≥ 0, focal loss is applied.

`loss_gamma`

float, optional
Gamma parameter for focal loss.

`label_smoothing`

float, optional
Label smoothing coefficient.

`loss_reduction`

str, optional, defaults to "sum"
Aggregation method for loss.
Choices: "sum", "mean"

`lr_encoder`

float, required
Learning rate for encoder parameters.

`lr_others`

float, required
Learning rate for non-encoder parameters (e.g., span layers, label encoder).

`weight_decay_encoder`

float, optional
Weight decay for encoder parameters.

`weight_decay_other`

float, optional
Weight decay for non-encoder parameters.

`max_grad_norm`

float, optional
Maximum norm for gradient clipping.

`root_dir`

str, required
Root directory for logs and checkpoints.

`train_data`

str, required
Path to training data JSON file.

`val_data_dir`

str, optional
Path or identifier for validation data.

`prev_path`

str or null, optional
Path to pretrained model checkpoint. Set to "none" or null if not used.

`save_total_limit`

int, optional
Maximum number of checkpoints to keep.

`size_sup`

int, optional, defaults to -1
Max number of supervised examples to use. -1 means all.

`max_types`

int, optional, defaults to 25
Max number of types (labels) allowed during training.

`shuffle_types`

bool, optional, defaults to true
Randomly shuffle label types during training.

`random_drop`

bool, optional, defaults to true
Whether to randomly drop entities during training (for robustness).

`max_neg_type_ratio`

float, optional, defaults to 1.0
Controls the ratio of negative to positive entity types.

`max_len`

int, optional, defaults to 386
Maximum sequence length passed to the encoder.

`freeze_token_rep`

bool, optional, defaults to false
If true, freezes token-level span representation layers.

Span Representation Layers

GLiNER supports multiple span representation strategies that define how text spans (e.g., entity candidates) are encoded using the contextualized token embeddings from the encoder. These are selected via the span_mode parameter in GLiNERConfig.

Modes

`markerV0` ^[source]

Projects the start and end token embeddings with MLPs, concatenates them, then applies a final projection. Lightweight and effective.

Module: SpanMarkerV0
Recommended use: Default general-purpose span representation.

`marker` ^[source]

Similar to markerV0, but uses deeper two-layer projections for start and end positions separately before fusion. More expressive but slightly more computationally expensive.

Module: SpanMarker
Recommended use: When higher span-level abstraction is needed.

`query` ^[source]

Uses learned per-width query vectors to extract span representations via a dot-product attention-like einsum. Resulting tensor is projected.

Module: SpanQuery
Recommended use: When fixed span queries per width are desirable.

`mlp` ^[source]

Applies a feedforward MLP over token embeddings and reshapes output to [B, L, max_width, D].

Module: SpanMLP
Recommended use: Efficient and parallel but ignores positional structure.

`cat` ^[source]

Concatenates token embeddings with learned span-width embeddings, then projects the result. Explicitly models span width.

Module: SpanCAT
Recommended use: When span length is a relevant feature.

`conv_conv` ^[source]

Applies a series of 1D convolutions with increasing kernel sizes. Captures internal structure within spans.

Module: SpanConv
Recommended use: Tasks that benefit from compositional features inside spans.

`conv_max` / `conv_mean` / `conv_sum` ^[source]

Apply pooling (max, mean, sum) over spans using increasing kernel sizes.
Behavior differs only by pooling type.

Module: SpanConvBlock
Recommended use: When a fixed summary of span tokens is appropriate.

`conv_share` ^[source]

Applies a shared convolutional kernel with increasing receptive fields. Parameter-efficient.

Module: ConvShare
Recommended use: When model size or shared pattern extraction is prioritized.

Each representation module returns a tensor of shape [B, L, max_width, D] and can be initiated interchangeably via the SpanRepLayer interface.

GLiNERConfig [source]​

Parameters​

model_name​

labels_encoder​

name​

max_width​

hidden_size​

dropout​

fine_tune​

subtoken_pooling​

span_mode [source]​

post_fusion_schema [source]​

num_post_fusion_layers​

vocab_size​

max_neg_type_ratio​

max_types​

max_len​

words_splitter_type​

has_rnn​

fuse_layers​

embed_ent_token​

class_token_index​

encoder_config​

labels_encoder_config​

ent_token​

sep_token​

_attn_implementation​

Examples​

Initiate model from config​

TrainingArguments [source]​

Parameters​

cache_dir​

optim​

others_lr​

others_weight_decay​

focal_loss_alpha​

focal_loss_gamma​

label_smoothing​

loss_reduction​

negatives​

masking​

Training Configuration [source]​

Parameters​

model_name​

labels_encoder​

name​

max_width​

hidden_size​

dropout​

fine_tune​

subtoken_pooling​

fuse_layers​

post_fusion_schema​

span_mode​

num_steps​

train_batch_size​

eval_every​

warmup_ratio​

scheduler_type​

loss_alpha​

loss_gamma​

label_smoothing​

loss_reduction​

lr_encoder​

lr_others​

weight_decay_encoder​

weight_decay_other​

max_grad_norm​

root_dir​

train_data​

val_data_dir​

prev_path​

save_total_limit​

size_sup​

max_types​

shuffle_types​

random_drop​

max_neg_type_ratio​

max_len​

freeze_token_rep​

GLiNERConfig ^[source]

Parameters

`model_name`

`labels_encoder`

`name`

`max_width`

`hidden_size`

`dropout`

`fine_tune`

`subtoken_pooling`

`span_mode` ^[source]

`post_fusion_schema` ^[source]

`num_post_fusion_layers`

`vocab_size`

`max_neg_type_ratio`

`max_types`

`max_len`

`words_splitter_type`

`has_rnn`

`fuse_layers`

`embed_ent_token`

`class_token_index`

`encoder_config`

`labels_encoder_config`

`ent_token`

`sep_token`

`_attn_implementation`

Examples

Initiate model from config

TrainingArguments ^[source]

Parameters

`cache_dir`

`optim`

`others_lr`

`others_weight_decay`

`focal_loss_alpha`

`focal_loss_gamma`

`label_smoothing`

`loss_reduction`

`negatives`

`masking`

Training Configuration ^[source]

Parameters

`model_name`

`labels_encoder`

`name`

`max_width`

`hidden_size`

`dropout`

`fine_tune`

`subtoken_pooling`

`fuse_layers`

`post_fusion_schema`

`span_mode`

`num_steps`

`train_batch_size`

`eval_every`

`warmup_ratio`

`scheduler_type`

`loss_alpha`

`loss_gamma`

`label_smoothing`

`loss_reduction`

`lr_encoder`

`lr_others`

`weight_decay_encoder`

`weight_decay_other`

`max_grad_norm`

`root_dir`

`train_data`

`val_data_dir`

`prev_path`

`save_total_limit`

`size_sup`

`max_types`

`shuffle_types`

`random_drop`

`max_neg_type_ratio`

`max_len`

`freeze_token_rep`