neural
pymusas.taggers.neural
NeuralTagger
class NeuralTagger:
| ...
| def __init__(
| self,
| pretrained_model_name_or_path: str | Path,
| top_n: int = -1,
| device: str = 'cpu',
| tokenizer_kwargs: dict[str, Any] | None = None
| ) -> None
The tagger when called, through __call__, and given a sequence of
tokens, to create a list of possible candidate tags for each token in the sequence.
NOTE at the moment only single word expressions are supported.
The number of possible candidate tags for each token is determined by the
top_n parameter, of which this is then stored in the top_n attribute.
Rule based exceptions
- If the token is only whitespace, e.g.
,\t,\n, etc. then the tagger will return only one tag which will be theZ9tag and no other tags, even iftop_nis greater than 1.
Parameters¶
-
pretrained_model_name_or_path :
str | Path
The string ID or path of the pretrained neural Word Sense Disambiguation (WSD) model to load.NOTE: currently we only support the wsd_torch_models.bem.BEM model
- A string, the model id of a pretrained wsd-torch-models that is hosted on the HuggingFace Hub.
- A
Pathorstrthat is a directory that can be loaded throughfrom_pretrainedmethod from a wsd-torch-models model
NOTE: this model name or path has to also be able to load the tokenizer using the function
transformers.AutoTokenizer.from_pretrained(pretrained_model_name_or_path) -
top_n :
int, optional (default =-1)
The number of tags to predict. Default -1 which predicts all tags. If 0 will raise a ValueError. -
device :
str, optional (default ='cpu')
The device to load the model on. e.g.'cpu', it has to be a string that can be passed totorch.device. -
tokenizer_kwargs :
dict[str, Any] | None, optional (default =None)
Keyword arguments to pass to the tokenizer'stransformers.AutoTokenizer.from_pretrainedmethod. These keyword arguments are only passed to the tokenizer on initialization.
Instance Attributes¶
- wsd_model :
wsd_torch_models.bem.BEM
The neural Word Sense Disambiguation (WSD) model that was loaded using thepretrained_model_name_or_path. - tokenizer :
transformers.PreTrainedTokenizerBase
The tokenizer that was loaded using thepretrained_model_name_or_path. - top_n :
int
The number of tags to predict. - device :
torch.device
The device that thewsd_modelwas loaded on. e.g.torch.devicetokenizer_kwargs (dict[str, Any] | None): Keyword arguments to pass to the tokenizer'stransformers.AutoTokenizer.from_pretrainedmethod. Default None.
Raises¶
ValueError
Iftop_nis 0 or less than -1.
Examples¶
from pymusas.taggers.neural import NeuralTagger
tokenizer_kwargs = {"add_prefix_space": True}
neural_tagger = NeuralTagger("ucrelnlp/PyMUSAS-Neural-English-Small-BEM",
device="cpu", top_n=2, tokenizer_kwargs=tokenizer_kwargs)
tokens = ["The", "river", "bank", "was", "full", "of", "fish", " "]
tags_and_indices = neural_tagger(tokens)
expected_tags = [["Z5", "N5"], ["M4", "W3"], ["M4", "W3"], ["A3", "Z5"],
["N5.1", "I3.2"], ["Z5", "N5"], ["L2", "F1"], ["Z9"]]
expected_tag_indices = [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 4)],
[(4, 5)], [(5, 6)], [(6, 7)], [(7, 8)]]
assert tags_and_indices == list(zip(expected_tags, expected_tag_indices))
__call__
class NeuralTagger:
| ...
| @torch.inference_mode(mode=True)
| def __call__(
| self,
| tokens: List[str]
| ) -> List[Tuple[List[str], List[Tuple[int, int]]]]
Given a List of tokens it returns for each token:
- A
Listof tags. The first tag in theListof tags is the most likely tag. - A
ListofTupleswhereby eachTupleindicates the start and end token index of the associated Multi Word Expression (MWE). If theListcontains more than oneTuplethen the MWE is discontinuous. For single word expressions theListwill only contain 1Tuplewhich will be (token_start_index, token_start_index + 1).
NOTE: we recommend that the number of tokens in the list should represent a sentence, in addition the more tokens in the list the more memory the model requires and on CPU at least the more time it will take to predict the tags.
NOTE: Currently the Neural Tagger is limited to only tagging single word expressions.
This function is wrapped in a
torch.inference_model
decorator which makes the model run more efficiently.
Parameters¶
- tokens :
List[str]
A List of full text form of the tokens to be tagged.
Returns¶
List[Tuple[List[str], List[Tuple[int, int]]]]
Raises¶
ValueError
If the number of tokens given is not the same as the number of tags predicted/returned.