Skip to main content

neural

pymusas.spacy_api.taggers.neural

[SOURCE]


NeuralTagger

class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __init__(
| self,
| name: str = 'pymusas_neural_tagger',
| pymusas_tags_token_attr: str = 'pymusas_tags',
| pymusas_mwe_indexes_attr: str = 'pymusas_mwe_indexes',
| top_n: int = 5,
| device: str = 'cpu',
| tokenizer_kwargs: dict[str, Any] | None = None
| ) -> None

spaCy pipeline component of the pymusas.taggers.neural.NeuralTagger.

The component creates a list of possible candidate semantic/sense tags for each token in the sequence, these tags are then assigned to Token._.pymusas_tags attribute in addition a List of token indexes indicating if the token is part of a Multi Word Expression (MWE) is assigned to the Token._.pymusas_mwe_indexes. NOTE at the moment only single word expressions are supported.

The number of possible candidate tags for each token is determined by the top_n parameter, of which this is then stored in the top_n attribute.

Rule based exceptions

  • If the token is only whitespace, e.g. , \t , \n, etc. then the tagger will return only one tag which will be the Z9 tag and no other tags, even if top_n is greater than 1.

Assigned Attributes

Location Type Value
Token._.pymusas_tags List[str] Predicted tags, the first tag in the List of tags is the most likely tag.
Token._.pymusas_mwe_indexes List[Tuple[int, int]] Each Tuple indicates the start and end token index of the associated Multi Word Expression (MWE). If the List contains more than one Tuple then the MWE is discontinuous. For single word expressions the List will only contain 1 Tuple which will be (token_start_index, token_start_index + 1). NOTE at the moment only single word expressions are supported.

Config and implementation

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

SettingDescription
pymusas_tags_token_attrSee parameters section below
pymusas_mwe_indexes_attrSee parameters section below
top_nSee parameters section below
deviceSee parameters section below
tokenizer_kwargsSee parameters section below

Parameters

  • name : str, optional (default = pymusas_neural_tagger)
    The component name. Defaults to the same name as the class variable COMPONENT_NAME.
  • pymusas_tags_token_attr : str, optional (default = pymusas_tags)
    The name of the attribute to assign the predicted tags too under the Token._ class.
  • pymusas_mwe_indexes_attr : str, optional (default = pymusas_mwe_indexes)
    The name of the attribute to assign the start and end token index of the associated MWE too under the Token._ class.
  • top_n : int, optional (default = 5)
    The number of tags to predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError.
  • device : str, optional (default = 'cpu')
    The device to load the model, wsd_model, on. e.g. 'cpu', it has to be a string that can be passed to torch.device.
  • tokenizer_kwargs : dict[str, Any] | None, optional (default = None)
    Keyword arguments to pass to the tokenizer's transformers.AutoTokenizer.from_pretrained method. These keyword arguments are only passed to the tokenizer on initialization. NOTE any value that is a custom object will not be serializable with the to_bytes and from_bytes when these methods have been implemented. If you save this component to disk when it is loaded this will become None as the tokenizer itself self.tokenizer will contain the the contents of tokenizer_kwargs.

Instance Attributes

  • name : str
    The component name.
  • pymusas_tags_token_attr : str, optional (default = pymusas_tags)
    The given pymusas_tags_token_attr
  • pymusas_mwe_indexes_attr : str, optional (default = pymusas_mwe_indexes)
    The given pymusas_mwe_indexes_attr
  • top_n : int, optional (default = 5)
    The number of tags to predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError.
  • device : torch.device
    The device that the wsd_model will be loaded on. e.g. torch.device
  • wsd_model : wsd_torch_models.bem.BEM | None, optional (default = None)
    The neural Word Sense Disambiguation (WSD) model. This is None until the component is initialized or has been loaded from disk or bytes.
  • tokenizer : transformers.PreTrainedTokenizerBase | None, optional (default = None)
    The sub-word tokenizer that the wsd_model uses. This tokenizer further tokenizes the tokens from the spaCy tokenizer, hence it being a sub-word tokenizer. This is None until the component is initialized or has been loaded from disk or bytes.
  • _tokenizer_kwargs : dict[str, Any] | None, optional (default = None)
    The keyword arguments that have or will be passed to the tokenizer's transformers.AutoTokenizer.from_pretrained method. These keyword arguments are only passed to the tokenizer on initialization.

Class Attributes

  • COMPONENT_NAME : str
    Name of component factory that this component is registered under. This is used as the first argument to Language.add_pipe if you want to add this component to your spaCy pipeline.

Raises

  • ValueError
    If top_n is 0 or less than -1.

Examples

import spacy
from pymusas.spacy_api.taggers.neural import NeuralTagger
# Construction via spaCy pipeline
nlp = spacy.blank('en')
# Using default config
tagger = nlp.add_pipe('pymusas_neural_tagger')
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
token = nlp('Hello')
assert token[0]._.pymusas_tags == ['Q2.2', 'Z4', 'Q2', 'X3.2', 'Q2.1']
assert token[0]._.pymusas_mwe_indexes == [(0, 1)]
# Custom config
custom_config = {'pymusas_tags_token_attr': 'semantic_tags',
'pymusas_mwe_indexes_attr': 'mwe_indexes',
'top_n': 2,
'tokenizer_kwargs': {'add_prefix_space': True}}
nlp = spacy.blank('en')
tagger = nlp.add_pipe('pymusas_neural_tagger', config=custom_config)
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
token = nlp('Hello')
assert token[0]._.semantic_tags == ['Q2.2', 'Z4']
assert token[0]._.mwe_indexes == [(0, 1)]

COMPONENT_NAME

class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| COMPONENT_NAME = 'pymusas_neural_tagger'

initialize

class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def initialize(
| self,
| get_examples: Optional[Callable[[], Iterable[Example]]] = None,
| *,
| nlp: Optional[Language] = None,
| pretrained_model_name_or_path: Optional[str | Path] = None
| ) -> None

Initialize the tagger and load any of the resources given. The method is typically called by Language.initialize and lets you customize arguments it receives via the initialize.components block in the config. The loading only happens during initialization, typically before training. At runtime, all data is load from disk.

Parameters

  • pretrained_model_name_or_path : str | Path
    The string ID or path of the pretrained neural Word Sense Disambiguation (WSD) model to load.

    NOTE: currently we only support the wsd_torch_models.bem.BEM model

    • A string, the model id of a pretrained wsd-torch-models that is hosted on the HuggingFace Hub.
    • A Path or str that is a directory that can be loaded through from_pretrained method from a wsd-torch-models model

    NOTE: this model name or path has to also be able to load the tokenizer using the function transformers.AutoTokenizer.from_pretrained(pretrained_model_name_or_path)

__call__

class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __call__(doc: Doc) -> Doc

Applies the tagger to the spaCy document, modifies it in place, and returns it. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

Parameters

Returns

  • Doc

to_disk

class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def to_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> None

Serialises the tagger to the given path.

Parameters

  • path : Union[str, Path]
    Path to a directory. Path may be either string or Path-like object. If the directory does not exist it attempts to create a directory at the given path.

  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns

  • None

Examples

from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.neural import NeuralTagger
tagger = NeuralTagger()
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)

from_disk

class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def from_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "NeuralTagger"

Loads the tagger from the given path in place and returns it.

Parameters

  • path : Union[str, Path]
    Path to an existing directory. Path may be either string or Path-like object.

  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns

Examples

from pathlib import Path
from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.neural import NeuralTagger
tagger = NeuralTagger()
tagger_2 = NeuralTagger()
assert tagger_2.wsd_model is None
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
_ = tagger_2.from_disk(temp_dir)

assert tagger_2.wsd_model.base_model_name == tagger.wsd_model.base_model_name