Skip to main content

hybrid

pymusas.spacy_api.taggers.hybrid

[SOURCE]


HybridTagger

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def __init__(
| self,
| name: str = 'pymusas_hybrid_tagger',
| pymusas_tags_token_attr: str = 'pymusas_tags',
| pymusas_mwe_indexes_attr: str = 'pymusas_mwe_indexes',
| pos_attribute: str = 'pos_',
| lemma_attribute: str = 'lemma_',
| top_n: int = 5,
| device: str = 'cpu',
| tokenizer_kwargs: dict[str, Any] | None = None
| ) -> None

spaCy pipeline component of the pymusas.taggers.hybrid.HybridTagger.

This is a hybrid tagger which uses both the pymusas.spacy_api.taggers.rule_based.RuleBasedTagger and the pymusas.spacy_api.taggers.neural.NeuralTagger taggers. This tagger inherits from the RuleBasedTagger and NeuralTagger. The difference between this and the RuleBasedTagger is that this tagger will use the NeuralTagger to tag tokens that the RuleBasedTagger cannot tag, these are the tokens that will be tagged with the Z99 default tag using the RuleBasedTagger.

The tagger when called, through __call__, and given a sequence of tokens and their associated linguistic data (lemma, Part Of Speech (POS)) will apply one or more pymusas.taggers.rules.rule.Rules to create a list of possible candidate tags for each token in the sequence. Each candidate, represented as a pymusas.rankers.ranking_meta_data.RankingMetaData object, for each token is then Ranked using a pymusas.rankers.lexicon_entry.LexiconEntryRanker ranker. The best candidate and it's associated tag(s) for each token are then returned along with a List of token indexes indicating if the token is part of a Multi Word Expression (MWE).

If we cannot tag a token then the following process will happen:

  1. If the token's POS tag is in default_punctuation_tags then it will assign the tag PUNCT.
  2. If the token's POS tag is in default_number_tags then it will assign the tag N1.
  3. Use the NeuralTagger to tag the token. The tags generated by the NeuralTagger are determined by how you have initialised the NeuralTagger.

Assigned Attributes

Location Type Value
Token._.pymusas_tags List[str] Predicted tags, the first tag in the List of tags is the most likely tag.
Token._.pymusas_mwe_indexes List[Tuple[int, int]] Each Tuple indicates the start and end token index of the associated Multi Word Expression (MWE). If the List contains more than one Tuple then the MWE is discontinuous. For single word expressions the List will only contain 1 Tuple which will be (token_start_index, token_start_index + 1).

Config and implementation

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

SettingDescription
pymusas_tags_token_attrSee parameters section below
pymusas_mwe_indexes_attrSee parameters section below
pos_attributeSee parameters section below
lemma_attributeSee parameters section below
top_nSee parameters section below
deviceSee parameters section below
tokenizer_kwargsSee parameters section below

Parameters

  • name : str, optional (default = pymusas_hybrid_tagger)
    The component name. Defaults to the same name as the class variable COMPONENT_NAME.
  • pymusas_tags_token_attr : str, optional (default = pymusas_tags)
    The name of the attribute to assign the predicted tags too under the Token._ class.
  • pymusas_mwe_indexes_attr : str, optional (default = pymusas_mwe_indexes)
    The name of the attribute to assign the start and end token index of the associated MWE too under the Token._ class.
  • pos_attribute : str, optional (default = pos_)
    The name of the attribute that the Part Of Speech (POS) tag is assigned too within the Token class. The POS tag value that comes from this attribute has to be of type str. With the current default we take the POS tag from Token.pos_. The POS tag can be an empty string if you do not require POS information or if you do not have a POS tagger. NOTE that if you do not have a POS tagger the default value for Token.pos_ is an empty string.
  • lemma_attribute : str, optional (default = lemma_)
    The name of the attribute that the lemma is assigned too within the Token class. The lemma value that comes from this attribute has to be of type str. With the current default we take the lemma from Token.lemma_. The lemma can be an empty string if you do not require lemma information or if you do not have a lemmatiser. NOTE that if you do not have a lemmatiser the default value for Token.lemma_ is an empty string.
  • top_n : int, optional (default = 5)
    The number of tags the NeuralTagger will predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError.
  • device : str, optional (default = 'cpu')
    The device to load the NeuralTagger model, wsd_model, on. e.g. 'cpu', it has to be a string that can be passed to torch.device.
  • tokenizer_kwargs : dict[str, Any] | None, optional (default = None)
    Keyword arguments to pass to the NeuralTagger's sub-word tokenizer's transformers.AutoTokenizer.from_pretrained method. These keyword arguments are only passed to the tokenizer on initialization.

Instance Attributes

  • name : str
    The component name.
  • pymusas_tags_token_attr : str, optional (default = pymusas_tags)
    The given pymusas_tags_token_attr
  • pymusas_mwe_indexes_attr : str, optional (default = pymusas_mwe_indexes)
    The given pymusas_mwe_indexes_attr
  • rules : List[pymusas.taggers.rules.rule.Rule], optional (default = None)
    For the RuleBasedTagger. The rules is set through the initialize method. Before it is set by the initialize method the value of this attribute is None.
  • ranker : pymusas.rankers.lexicon_entry.LexiconEntryRanker, optional (default = None)
    For the RuleBasedTagger. The ranker is set through the initialize method. Before it is set by the initialize method the value of this attribute is None.
  • default_punctuation_tags : Set[str]
    For the RuleBasedTagger. The default_punctuation_tags is set through the initialize method.
  • default_number_tags : Set[str]
    For the RuleBasedTagger. The default_number_tags is set through the initialize method.
  • pos_attribute : str, optional (default = pos_)
    For the RuleBasedTagger. The given pos_attribute
  • lemma_attribute : str, optional (default = lemma_)
    For the RuleBasedTagger. The given lemma_attribute
  • top_n : int, optional (default = 5)
    For the NeuralTagger. The number of tags to predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError.
  • device : torch.device
    For the NeuralTagger. The device that the wsd_model will be loaded on. e.g. torch.device
  • wsd_model : wsd_torch_models.bem.BEM | None, optional (default = None)
    For the NeuralTagger. The neural Word Sense Disambiguation (WSD) model. This is None until the component is initialized or has been loaded from disk or bytes.
  • tokenizer : transformers.PreTrainedTokenizerBase | None, optional (default = None)
    For the NeuralTagger. The sub-word tokenizer that the wsd_model uses. This tokenizer further tokenizes the tokens from the spaCy tokenizer, hence it being a sub-word tokenizer. This is None until the component is initialized or has been loaded from disk or bytes.
  • _tokenizer_kwargs : dict[str, Any] | None, optional (default = None)
    For the NeuralTagger. The keyword arguments that have or will be passed to the tokenizer's transformers.AutoTokenizer.from_pretrained method. These keyword arguments are only passed to the tokenizer on initialization.

Class Attributes

  • COMPONENT_NAME : str
    Name of component factory that this component is registered under. This is used as the first argument to Language.add_pipe if you want to add this component to your spaCy pipeline.

Raises

  • ValueError
    If top_n is 0 or less than -1.

Examples

import spacy
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.lexicon_collection import LexiconCollection
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/e5cef7be2aa6182e300152f4f55152310007f051/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
# Construction via spaCy pipeline
nlp = spacy.blank('en')
# Using default config
tagger = nlp.add_pipe('pymusas_hybrid_tagger')
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
tokens = nlp('The river full of creaturez')
all_tags = [token._.pymusas_tags for token in tokens]
all_indexes = [token._.pymusas_mwe_indexes for token in tokens]
assert all_tags == [['Z5'], ['W3/M4', 'N5+'], ['N5.1+'], ['Z5'], ['Z1', 'S2', 'S2.2', 'S3.2', 'S2.1']]
assert all_indexes == [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 4)], [(4, 5)]]
# Custom config
custom_config = {'pymusas_tags_token_attr': 'semantic_tags',
'pymusas_mwe_indexes_attr': 'mwe_indexes',
'top_n': 2,
'tokenizer_kwargs': {'add_prefix_space': True}}
nlp = spacy.blank('en')
tagger = nlp.add_pipe('pymusas_hybrid_tagger', config=custom_config)
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
tokens = nlp('The river full of creaturez')
all_tags = [token._.semantic_tags for token in tokens]
all_indexes = [token._.mwe_indexes for token in tokens]
assert all_tags == [['Z5'], ['W3/M4', 'N5+'], ['N5.1+'], ['Z5'], ['Z1', 'S2']]
assert all_indexes == [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 4)], [(4, 5)]]

COMPONENT_NAME

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| COMPONENT_NAME = 'pymusas_hybrid_tagger'

initialize

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def initialize(
| self,
| get_examples: Optional[Callable[[], Iterable[Example]]] = None,
| *,
| nlp: Optional[Language] = None,
| rules: Optional[List[Rule]] = None,
| ranker: Optional[LexiconEntryRanker] = None,
| default_punctuation_tags: Optional[List[str]] = None,
| default_number_tags: Optional[List[str]] = None,
| pretrained_model_name_or_path: Optional[str | Path] = None
| ) -> None

Initialize the tagger and load any of the resources given. The method is typically called by Language.initialize and lets you customize arguments it receives via the initialize.components block in the config. The loading only happens during initialization, typically before training. At runtime, all data is load from disk.

Parameters

  • rules : List[pymusas.taggers.rules.rule.Rule]
    A list of rules to apply to the sequence of tokens in the __call__. The output from each rule is concatenated and given to the ranker.

  • ranker : pymusas.rankers.lexicon_entry.LexiconEntryRanker
    A ranker to rank the output from all of the rules.

  • default_punctuation_tags : List[str], optional (default = None)
    The POS tags that represent punctuation. If None then we will use ['punc']. The list will be converted into a Set before assigning to the default_punctuation_tags attribute.

  • default_number_tags : List[str], optional (default = None)
    The POS tags that represent numbers. If None then we will use ['num']. The list will be converted into a Set before assigning to the default_number_tags attribute.

  • pretrained_model_name_or_path : str | Path
    The string ID or path of the pretrained neural Word Sense Disambiguation (WSD) model to load.

    NOTE: currently we only support the wsd_torch_models.bem.BEM model

    • A string, the model id of a pretrained wsd-torch-models that is hosted on the HuggingFace Hub.
    • A Path or str that is a directory that can be loaded through from_pretrained method from a wsd-torch-models model

    NOTE: this model name or path has to also be able to load the tokenizer using the function transformers.AutoTokenizer.from_pretrained(pretrained_model_name_or_path)

__call__

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def __call__(doc: Doc) -> Doc

Applies the tagger to the spaCy document, modifies it in place, and returns it. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

Parameters

Returns

  • Doc

to_bytes

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def to_bytes(
| self,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> bytes

Not Implemented

Even though the HyBridTagger inherits from RuleBased tagger which has implemented this method, NeuralTagger has not therefore it is not implemented for the HybridTagger.

from_bytes

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def from_bytes(
| self,
| bytes_data: bytes,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "HybridTagger"

Not Implemented

Even though the HyBridTagger inherits from RuleBased tagger which has implemented this method, NeuralTagger has not therefore it is not implemented for the HybridTagger.

to_disk

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def to_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> None

Serialises the tagger to the given path.

Parameters

  • path : Union[str, Path]
    Path to a directory. Path may be either string or Path-like object. If the directory does not exist it attempts to create a directory at the given path.

  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns

  • None

Examples

from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.hybrid import HybridTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.lexicon_collection import LexiconCollection
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/e5cef7be2aa6182e300152f4f55152310007f051/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
tagger = HybridTagger()
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)

from_disk

class HybridTagger(RuleBasedTagger, NeuralTagger):
| ...
| def from_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "HybridTagger"

Loads the tagger from the given path in place and returns it.

Parameters

  • path : Union[str, Path]
    Path to an existing directory. Path may be either string or Path-like object.

  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns

Examples

from tempfile import TemporaryDirectory
from pymusas.spacy_api.taggers.hybrid import HybridTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.lexicon_collection import LexiconCollection
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/e5cef7be2aa6182e300152f4f55152310007f051/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
tagger = HybridTagger()
tagger.initialize(rules=[single_word_rule],
ranker=ranker,
pretrained_model_name_or_path="ucrelnlp/PyMUSAS-Neural-English-Small-BEM")
tagger_2 = HybridTagger()
assert tagger_2.wsd_model is None
assert tagger_2.ranker is None
assert tagger_2.rules is None
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
_ = tagger_2.from_disk(temp_dir)

assert tagger_2.wsd_model.base_model_name == tagger.wsd_model.base_model_name
assert tagger_2.ranker == ranker
assert tagger_2.rules == [single_word_rule]