hybrid
pymusas.spacy_api.taggers.hybrid
HybridTagger
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def __init__(
| self,
| name: str = 'pymusas_hybrid_tagger',
| pymusas_tags_token_attr: str = 'pymusas_tags',
| pymusas_mwe_indexes_attr: str = 'pymusas_mwe_indexes',
| pos_attribute: str = 'pos_',
| lemma_attribute: str = 'lemma_',
| top_n: int = 5,
| device: str = 'cpu',
| tokenizer_kwargs: dict[str, Any] | None = None
| ) -> None
spaCy pipeline component
of the pymusas.taggers.hybrid.HybridTagger.
This is a hybrid tagger which uses both the
pymusas.spacy_api.taggers.rule_based.RuleBasedTagger
and the pymusas.spacy_api.taggers.neural.NeuralTagger taggers. This tagger
inherits from the RuleBasedTagger and NeuralTagger.
The difference between this and the
RuleBasedTagger is that this tagger will use the NeuralTagger to
tag tokens that the RuleBasedTagger cannot tag, these are the tokens that
will be tagged with the Z99 default tag using the RuleBasedTagger.
The tagger when called, through __call__, and given a sequence of
tokens and their associated linguistic data (lemma, Part Of Speech (POS))
will apply one or more pymusas.taggers.rules.rule.Rules
to create a list of possible candidate tags for each token in the sequence.
Each candidate, represented as a
pymusas.rankers.ranking_meta_data.RankingMetaData object, for each
token is then Ranked using a
pymusas.rankers.lexicon_entry.LexiconEntryRanker ranker. The best
candidate and it's associated tag(s) for each token are then returned along
with a List of token indexes indicating if the token is part of a Multi
Word Expression (MWE).
If we cannot tag a token then the following process will happen:
- If the token's POS tag is in
default_punctuation_tagsthen it will assign the tagPUNCT. - If the token's POS tag is in
default_number_tagsthen it will assign the tagN1. - Use the
NeuralTaggerto tag the token. The tags generated by theNeuralTaggerare determined by how you have initialised theNeuralTagger.
Assigned Attributes¶
| Location | Type | Value |
|---|---|---|
| Token._.pymusas_tags | List[str] | Predicted tags, the first tag in the List of tags is the most likely tag. |
| Token._.pymusas_mwe_indexes | List[Tuple[int, int]] | Each Tuple indicates the start and end token index of the
associated Multi Word Expression (MWE). If the List contains
more than one Tuple then the MWE is discontinuous. For single word
expressions the List will only contain 1 Tuple which will be
(token_start_index, token_start_index + 1). |
Config and implementation¶
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the config
argument on nlp.add_pipe or in your
config.cfg for training.
| Setting | Description |
|---|---|
| pymusas_tags_token_attr | See parameters section below |
| pymusas_mwe_indexes_attr | See parameters section below |
| pos_attribute | See parameters section below |
| lemma_attribute | See parameters section below |
| top_n | See parameters section below |
| device | See parameters section below |
| tokenizer_kwargs | See parameters section below |
Parameters¶
- name :
str, optional (default =pymusas_hybrid_tagger)
The component name. Defaults to the same name as the class variableCOMPONENT_NAME. - pymusas_tags_token_attr :
str, optional (default =pymusas_tags)
The name of the attribute to assign the predicted tags too under theToken._class. - pymusas_mwe_indexes_attr :
str, optional (default =pymusas_mwe_indexes)
The name of the attribute to assign the start and end token index of the associated MWE too under theToken._class. - pos_attribute :
str, optional (default =pos_)
The name of the attribute that the Part Of Speech (POS) tag is assigned too within theTokenclass. The POS tag value that comes from this attribute has to be of typestr. With the current default we take the POS tag fromToken.pos_. The POS tag can be an empty string if you do not require POS information or if you do not have a POS tagger. NOTE that if you do not have a POS tagger the default value forToken.pos_is an empty string. - lemma_attribute :
str, optional (default =lemma_)
The name of the attribute that the lemma is assigned too within theTokenclass. The lemma value that comes from this attribute has to be of typestr. With the current default we take the lemma fromToken.lemma_. The lemma can be an empty string if you do not require lemma information or if you do not have a lemmatiser. NOTE that if you do not have a lemmatiser the default value forToken.lemma_is an empty string. - top_n :
int, optional (default =5)
The number of tags the NeuralTagger will predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError. - device :
str, optional (default ='cpu')
The device to load the NeuralTagger model,wsd_model, on. e.g.'cpu', it has to be a string that can be passed totorch.device. - tokenizer_kwargs :
dict[str, Any] | None, optional (default =None)
Keyword arguments to pass to the NeuralTagger's sub-word tokenizer'stransformers.AutoTokenizer.from_pretrainedmethod. These keyword arguments are only passed to the tokenizer on initialization.
Instance Attributes¶
- name :
str
The component name. - pymusas_tags_token_attr :
str, optional (default =pymusas_tags)
The givenpymusas_tags_token_attr - pymusas_mwe_indexes_attr :
str, optional (default =pymusas_mwe_indexes)
The givenpymusas_mwe_indexes_attr - rules :
List[pymusas.taggers.rules.rule.Rule], optional (default =None)
For the RuleBasedTagger. Therulesis set through theinitializemethod. Before it is set by theinitializemethod the value of this attribute isNone. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker, optional (default =None)
For the RuleBasedTagger. Therankeris set through theinitializemethod. Before it is set by theinitializemethod the value of this attribute isNone. - default_punctuation_tags :
Set[str]
For the RuleBasedTagger. Thedefault_punctuation_tagsis set through theinitializemethod. - default_number_tags :
Set[str]
For the RuleBasedTagger. Thedefault_number_tagsis set through theinitializemethod. - pos_attribute :
str, optional (default =pos_)
For the RuleBasedTagger. The givenpos_attribute - lemma_attribute :
str, optional (default =lemma_)
For the RuleBasedTagger. The givenlemma_attribute - top_n :
int, optional (default =5)
For the NeuralTagger. The number of tags to predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError. - device :
torch.device
For the NeuralTagger. The device that thewsd_modelwill be loaded on. e.g.torch.device - wsd_model :
wsd_torch_models.bem.BEM | None, optional (default =None)
For the NeuralTagger. The neural Word Sense Disambiguation (WSD) model. This isNoneuntil the component is initialized or has been loaded from disk or bytes. - tokenizer :
transformers.PreTrainedTokenizerBase | None, optional (default =None)
For the NeuralTagger. The sub-word tokenizer that thewsd_modeluses. This tokenizer further tokenizes the tokens from the spaCy tokenizer, hence it being a sub-word tokenizer. This isNoneuntil the component is initialized or has been loaded from disk or bytes. - _tokenizer_kwargs :
dict[str, Any] | None, optional (default =None)
For the NeuralTagger. The keyword arguments that have or will be passed to the tokenizer'stransformers.AutoTokenizer.from_pretrainedmethod. These keyword arguments are only passed to the tokenizer on initialization.
Class Attributes¶
- COMPONENT_NAME :
str
Name of component factory that this component is registered under. This is used as the first argument toLanguage.add_pipeif you want to add this component to your spaCy pipeline.
Raises¶
ValueError
Iftop_nis 0 or less than -1.
Examples¶
import spacy
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.lexicon_collection import LexiconCollection
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/e5cef7be2aa6182e300152f4f55152310007f051/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
# Construction via spaCy pipeline
nlp = spacy.blank('en')
# Using default config
tagger = nlp.add_pipe('pymusas_hybrid_tagger')
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
tokens = nlp('The river full of creaturez')
all_tags = [token._.pymusas_tags for token in tokens]
all_indexes = [token._.pymusas_mwe_indexes for token in tokens]
assert all_tags == [['Z5'], ['W3/M4', 'N5+'], ['N5.1+'], ['Z5'], ['Z1', 'S2', 'S2.2', 'S3.2', 'S2.1']]
assert all_indexes == [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 4)], [(4, 5)]]
# Custom config
custom_config = {'pymusas_tags_token_attr': 'semantic_tags',
'pymusas_mwe_indexes_attr': 'mwe_indexes',
'top_n': 2,
'tokenizer_kwargs': {'add_prefix_space': True}}
nlp = spacy.blank('en')
tagger = nlp.add_pipe('pymusas_hybrid_tagger', config=custom_config)
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
tokens = nlp('The river full of creaturez')
all_tags = [token._.semantic_tags for token in tokens]
all_indexes = [token._.mwe_indexes for token in tokens]
assert all_tags == [['Z5'], ['W3/M4', 'N5+'], ['N5.1+'], ['Z5'], ['Z1', 'S2']]
assert all_indexes == [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 4)], [(4, 5)]]
COMPONENT_NAME
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| COMPONENT_NAME = 'pymusas_hybrid_tagger'
initialize
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def initialize(
| self,
| get_examples: Optional[Callable[[], Iterable[Example]]] = None,
| *,
| nlp: Optional[Language] = None,
| rules: Optional[List[Rule]] = None,
| ranker: Optional[LexiconEntryRanker] = None,
| default_punctuation_tags: Optional[List[str]] = None,
| default_number_tags: Optional[List[str]] = None,
| pretrained_model_name_or_path: Optional[str | Path] = None
| ) -> None
Initialize the tagger and load any of the resources given. The method is
typically called by
Language.initialize
and lets you customize arguments it receives via the
initialize.components
block in the config. The loading only happens during initialization,
typically before training. At runtime, all data is load from disk.
Parameters¶
-
rules :
List[pymusas.taggers.rules.rule.Rule]
A list of rules to apply to the sequence of tokens in the__call__. The output from each rule is concatenated and given to theranker. -
ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
A ranker to rank the output from all of therules. -
default_punctuation_tags :
List[str], optional (default =None)
The POS tags that represent punctuation. IfNonethen we will use['punc']. The list will be converted into aSetbefore assigning to thedefault_punctuation_tagsattribute. -
default_number_tags :
List[str], optional (default =None)
The POS tags that represent numbers. IfNonethen we will use['num']. The list will be converted into aSetbefore assigning to thedefault_number_tagsattribute. -
pretrained_model_name_or_path :
str | Path
The string ID or path of the pretrained neural Word Sense Disambiguation (WSD) model to load.NOTE: currently we only support the wsd_torch_models.bem.BEM model
- A string, the model id of a pretrained wsd-torch-models that is hosted on the HuggingFace Hub.
- A
Pathorstrthat is a directory that can be loaded throughfrom_pretrainedmethod from a wsd-torch-models model
NOTE: this model name or path has to also be able to load the tokenizer using the function
transformers.AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
__call__
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def __call__(doc: Doc) -> Doc
Applies the tagger to the spaCy document, modifies it in place, and
returns it. This usually happens under the hood when the nlp object is
called on a text and all pipeline components are applied to the Doc in
order.
Parameters¶
- doc :
Doc
A spaCyDoc
Returns¶
Doc
to_bytes
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def to_bytes(
| self,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> bytes
Not Implemented
Even though the HyBridTagger inherits from RuleBased tagger which has implemented this method, NeuralTagger has not therefore it is not implemented for the HybridTagger.
from_bytes
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def from_bytes(
| self,
| bytes_data: bytes,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "HybridTagger"
Not Implemented
Even though the HyBridTagger inherits from RuleBased tagger which has implemented this method, NeuralTagger has not therefore it is not implemented for the HybridTagger.
to_disk
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def to_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> None
Serialises the tagger to the given path.
Parameters¶
-
path :
Union[str, Path]
Path to a directory. Path may be either string orPath-like object. If the directory does not exist it attempts to create a directory at the givenpath. -
exclude :
Iterable[str], optional (default =SimpleFrozenList())
This currently does not do anything, please ignore it.
Returns¶
None
Examples¶
from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.hybrid import HybridTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.lexicon_collection import LexiconCollection
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/e5cef7be2aa6182e300152f4f55152310007f051/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
tagger = HybridTagger()
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
from_disk
class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def from_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "HybridTagger"
Loads the tagger from the given path in place and returns it.
Parameters¶
-
path :
Union[str, Path]
Path to an existing directory. Path may be either string orPath-like object. -
exclude :
Iterable[str], optional (default =SimpleFrozenList())
This currently does not do anything, please ignore it.
Returns¶
Examples¶
from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.hybrid import HybridTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.lexicon_collection import LexiconCollection
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/e5cef7be2aa6182e300152f4f55152310007f051/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
tagger = HybridTagger()
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
tagger_2 = HybridTagger()
assert tagger_2.wsd_model is None
assert tagger_2.ranker is None
assert tagger_2.rules is None
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
_ = tagger_2.from_disk(temp_dir)
assert tagger_2.wsd_model.base_model_name == tagger.wsd_model.base_model_name
assert tagger_2.ranker == ranker
assert tagger_2.rules == [single_word_rule]