neural
pymusas.spacy_api.taggers.neural
NeuralTagger
class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __init__(
| self,
| name: str = 'pymusas_neural_tagger',
| pymusas_tags_token_attr: str = 'pymusas_tags',
| pymusas_mwe_indexes_attr: str = 'pymusas_mwe_indexes',
| top_n: int = 5,
| device: str = 'cpu',
| tokenizer_kwargs: dict[str, Any] | None = None
| ) -> None
spaCy pipeline component
of the pymusas.taggers.neural.NeuralTagger.
The component creates a list of possible candidate semantic/sense tags for
each token in the sequence, these tags are then assigned to
Token._.pymusas_tags attribute in addition a List of token indexes
indicating if the token is part of a Multi Word Expression (MWE) is assigned
to the Token._.pymusas_mwe_indexes. NOTE at the moment
only single word expressions are supported.
The number of possible candidate tags for each token is determined by the
top_n parameter, of which this is then stored in the top_n attribute.
Rule based exceptions
- If the token is only whitespace, e.g.
,\t,\n, etc. then the tagger will return only one tag which will be theZ9tag and no other tags, even iftop_nis greater than 1.
Assigned Attributes¶
| Location | Type | Value |
|---|---|---|
| Token._.pymusas_tags | List[str] | Predicted tags, the first tag in the List of tags is the most likely tag. |
| Token._.pymusas_mwe_indexes | List[Tuple[int, int]] | Each Tuple indicates the start and end token index of the
associated Multi Word Expression (MWE). If the List contains
more than one Tuple then the MWE is discontinuous. For single word
expressions the List will only contain 1 Tuple which will be
(token_start_index, token_start_index + 1). NOTE at the moment
only single word expressions are supported. |
Config and implementation¶
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the config
argument on nlp.add_pipe or in your
config.cfg for training.
| Setting | Description |
|---|---|
| pymusas_tags_token_attr | See parameters section below |
| pymusas_mwe_indexes_attr | See parameters section below |
| top_n | See parameters section below |
| device | See parameters section below |
| tokenizer_kwargs | See parameters section below |
Parameters¶
- name :
str, optional (default =pymusas_neural_tagger)
The component name. Defaults to the same name as the class variableCOMPONENT_NAME. - pymusas_tags_token_attr :
str, optional (default =pymusas_tags)
The name of the attribute to assign the predicted tags too under theToken._class. - pymusas_mwe_indexes_attr :
str, optional (default =pymusas_mwe_indexes)
The name of the attribute to assign the start and end token index of the associated MWE too under theToken._class. - top_n :
int, optional (default =5)
The number of tags to predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError. - device :
str, optional (default ='cpu')
The device to load the model,wsd_model, on. e.g.'cpu', it has to be a string that can be passed totorch.device. - tokenizer_kwargs :
dict[str, Any] | None, optional (default =None)
Keyword arguments to pass to the tokenizer'stransformers.AutoTokenizer.from_pretrainedmethod. These keyword arguments are only passed to the tokenizer on initialization. NOTE any value that is a custom object will not be serializable with theto_bytesandfrom_byteswhen these methods have been implemented. If you save this component to disk when it is loaded this will becomeNoneas the tokenizer itselfself.tokenizerwill contain the the contents oftokenizer_kwargs.
Instance Attributes¶
- name :
str
The component name. - pymusas_tags_token_attr :
str, optional (default =pymusas_tags)
The givenpymusas_tags_token_attr - pymusas_mwe_indexes_attr :
str, optional (default =pymusas_mwe_indexes)
The givenpymusas_mwe_indexes_attr - top_n :
int, optional (default =5)
The number of tags to predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError. - device :
torch.device
The device that thewsd_modelwill be loaded on. e.g.torch.device - wsd_model :
wsd_torch_models.bem.BEM | None, optional (default =None)
The neural Word Sense Disambiguation (WSD) model. This isNoneuntil the component is initialized or has been loaded from disk or bytes. - tokenizer :
transformers.PreTrainedTokenizerBase | None, optional (default =None)
The sub-word tokenizer that thewsd_modeluses. This tokenizer further tokenizes the tokens from the spaCy tokenizer, hence it being a sub-word tokenizer. This isNoneuntil the component is initialized or has been loaded from disk or bytes. - _tokenizer_kwargs :
dict[str, Any] | None, optional (default =None)
The keyword arguments that have or will be passed to the tokenizer'stransformers.AutoTokenizer.from_pretrainedmethod. These keyword arguments are only passed to the tokenizer on initialization.
Class Attributes¶
- COMPONENT_NAME :
str
Name of component factory that this component is registered under. This is used as the first argument toLanguage.add_pipeif you want to add this component to your spaCy pipeline.
Raises¶
ValueError
Iftop_nis 0 or less than -1.
Examples¶
import spacy
from pymusas.spacy_api.taggers.neural import NeuralTagger
# Construction via spaCy pipeline
nlp = spacy.blank('en')
# Using default config
tagger = nlp.add_pipe('pymusas_neural_tagger')
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
token = nlp('Hello')
assert token[0]._.pymusas_tags == ['Q2.2', 'Z4', 'Q2', 'X3.2', 'Q2.1']
assert token[0]._.pymusas_mwe_indexes == [(0, 1)]
# Custom config
custom_config = {'pymusas_tags_token_attr': 'semantic_tags',
'pymusas_mwe_indexes_attr': 'mwe_indexes',
'top_n': 2,
'tokenizer_kwargs': {'add_prefix_space': True}}
nlp = spacy.blank('en')
tagger = nlp.add_pipe('pymusas_neural_tagger', config=custom_config)
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
token = nlp('Hello')
assert token[0]._.semantic_tags == ['Q2.2', 'Z4']
assert token[0]._.mwe_indexes == [(0, 1)]
COMPONENT_NAME
class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| COMPONENT_NAME = 'pymusas_neural_tagger'
initialize
class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def initialize(
| self,
| get_examples: Optional[Callable[[], Iterable[Example]]] = None,
| *,
| nlp: Optional[Language] = None,
| pretrained_model_name_or_path: Optional[str | Path] = None
| ) -> None
Initialize the tagger and load any of the resources given. The method is
typically called by
Language.initialize
and lets you customize arguments it receives via the
initialize.components
block in the config. The loading only happens during initialization,
typically before training. At runtime, all data is load from disk.
Parameters¶
-
pretrained_model_name_or_path :
str | Path
The string ID or path of the pretrained neural Word Sense Disambiguation (WSD) model to load.NOTE: currently we only support the wsd_torch_models.bem.BEM model
- A string, the model id of a pretrained wsd-torch-models that is hosted on the HuggingFace Hub.
- A
Pathorstrthat is a directory that can be loaded throughfrom_pretrainedmethod from a wsd-torch-models model
NOTE: this model name or path has to also be able to load the tokenizer using the function
transformers.AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
__call__
class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __call__(doc: Doc) -> Doc
Applies the tagger to the spaCy document, modifies it in place, and
returns it. This usually happens under the hood when the nlp object is
called on a text and all pipeline components are applied to the Doc in
order.
Parameters¶
- doc :
Doc
A spaCyDoc
Returns¶
Doc
to_disk
class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def to_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> None
Serialises the tagger to the given path.
Parameters¶
-
path :
Union[str, Path]
Path to a directory. Path may be either string orPath-like object. If the directory does not exist it attempts to create a directory at the givenpath. -
exclude :
Iterable[str], optional (default =SimpleFrozenList())
This currently does not do anything, please ignore it.
Returns¶
None
Examples¶
from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.neural import NeuralTagger
tagger = NeuralTagger()
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
from_disk
class NeuralTagger(spacy.pipeline.pipe.Pipe):
| ...
| def from_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "NeuralTagger"
Loads the tagger from the given path in place and returns it.
Parameters¶
-
path :
Union[str, Path]
Path to an existing directory. Path may be either string orPath-like object. -
exclude :
Iterable[str], optional (default =SimpleFrozenList())
This currently does not do anything, please ignore it.
Returns¶
Examples¶
from pathlib import Path
from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.neural import NeuralTagger
tagger = NeuralTagger()
tagger_2 = NeuralTagger()
assert tagger_2.wsd_model is None
tagger.initialize(pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
_ = tagger_2.from_disk(temp_dir)
assert tagger_2.wsd_model.base_model_name == tagger.wsd_model.base_model_name