Skip to main content

hybrid

pymusas.taggers.hybrid

[SOURCE]


HybridTagger

class HybridTagger(RuleBasedTagger):
| ...
| def __init__(
| self,
| rules: List[Rule],
| ranker: LexiconEntryRanker,
| neural_tagger: NeuralTagger,
| default_punctuation_tags: Optional[Set[str]] = None,
| default_number_tags: Optional[Set[str]] = None
| ) -> None

This is a hybrid tagger which uses both the pymusas.taggers.rule_based.RuleBasedTagger and the pymusas.taggers.neural.NeuralTagger taggers. This tagger inherits from the RuleBasedTagger. The difference between this and the RuleBasedTagger is that this tagger will use the NeuralTagger to tag tokens that the RuleBasedTagger cannot tag, these are the tokens that will be tagged with the Z99 default tag using the RuleBasedTagger.

The tagger when called, through __call__, and given a sequence of tokens and their associated linguistic data (lemma, Part Of Speech (POS)) will apply one or more pymusas.taggers.rules.rule.Rules to create a list of possible candidate tags for each token in the sequence. Each candidate, represented as a pymusas.rankers.ranking_meta_data.RankingMetaData object, for each token is then Ranked using a pymusas.rankers.lexicon_entry.LexiconEntryRanker ranker. The best candidate and it's associated tag(s) for each token are then returned along with a List of token indexes indicating if the token is part of a Multi Word Expression (MWE).

If we cannot tag a token then the following process will happen:

  1. If the token's POS tag is in default_punctuation_tags then it will assign the tag PUNCT.
  2. If the token's POS tag is in default_number_tags then it will assign the tag N1.
  3. Use the NeuralTagger to tag the token. The tags generated by the NeuralTagger are determined by how you have initialised the NeuralTagger.

Parameters

  • rules : List[pymusas.taggers.rules.rule.Rule]
    A list of rules to apply to the sequence of tokens in the __call__. The output from each rule is concatenated and given to the ranker.
  • ranker : pymusas.rankers.lexicon_entry.LexiconEntryRanker
    A ranker to rank the output from all of the rules.
  • neural_tagger : pymusas.taggers.neural.NeuralTagger
    The NeuralTagger that will be used to tag tokens that the RuleBasedTagger cannot tag.
  • default_punctuation_tags : Set[str], optional (default = None)
    The POS tags that represent punctuation. If None then we will use the Set: set(['punc']).
  • default_number_tags : Set[str], optional (default = None)
    The POS tags that represent numbers. If None then we will use the Set: set(['num']).

Instance Attributes

  • rules : List[pymusas.taggers.rules.rule.Rule]
    The given rules.
  • ranker : pymusas.rankers.lexicon_entry.LexiconEntryRanker
    The given ranker.
  • neural_tagger : pymusas.taggers.neural.NeuralTagger
    The NeuralTagger that will be used to tag tokens that the RuleBasedTagger cannot tag.
  • default_punctuation_tags : Set[str]
    The given default_punctuation_tags
  • default_number_tags : Set[str]
    The given default_number_tags

Examples

from pymusas.lexicon_collection import LexiconCollection
from pymusas.taggers.neural import NeuralTagger
from pymusas.taggers.hybrid import HybridTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/64dbdf19d8d090c6f4183984ff16529d09f77b02/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
tokenizer_kwargs = {"add_prefix_space": True}
neural_tagger = NeuralTagger("ucrelnlp/PyMUSAS-Neural-English-Small-BEM",
device="cpu", top_n=2, tokenizer_kwargs=tokenizer_kwargs)
tagger = HybridTagger([single_word_rule], ranker, neural_tagger)
expected_tags_indices = [(['Z5'], [(0, 1)]), (['W3/M4', 'N5+'], [(1, 2)]),
(['N5.1+', 'I3.2+'], [(2, 3)]), (['Z5'], [(3, 4)]),
(['Z1', 'S2'], [(4, 5)])]
assert tagger(["The", "river", "full", "of", "creaturez"],
["the", "river", "full", "of", "creaturez"],
["DET", "NOUN", "ADJ", "ADP", "NOUN"]) == expected_tags_indices

__call__

class HybridTagger(RuleBasedTagger):
| ...
| def __call__(
| self,
| tokens: List[str],
| lemmas: List[str],
| pos_tags: List[str]
| ) -> List[Tuple[List[str],
| List[Tuple[int, int]]]]

Given a List of tokens, their associated lemmas and Part Of Speech (POS) tags it returns for each token:

  1. A List of tags. The first tag in the List of tags is the most likely tag.
  2. A List of Tuples whereby each Tuple indicates the start and end token index of the associated Multi Word Expression (MWE). If the List contains more than one Tuple then the MWE is discontinuous. For single word expressions the List will only contain 1 Tuple which will be (token_start_index, token_start_index + 1).

All the generated tags and MWEs are based on the rules, ranker, and NeuralTagger given to this model.

NOTE this tagger has been designed to be flexible with the amount of resources available, if you do not have POS or lemma information assign them a List of empty strings.

NOTE we recommend for the NeuralTagger that the number of tokens in the list should represent a sentence, in addition the more tokens in the list the more memory the NeuralTagger model requires and on CPU at least the more time it will take to predict the tags.

Parameters

  • tokens : List[str]
    A List of full text form of the tokens to be tagged.
  • lemmas : List[str]
    The List of lemma/base form of the tokens to be tagged.
  • pos_tags : List[str]
    The List of POS tags of the tokens to be tagged.

Returns

  • List[Tuple[List[str], List[Tuple[int, int]]]]

Raises

  • ValueError
    If the length of the tokens, lemmas, and pos_tags are not of the same length.

  • ValueError
    If the number of tokens given is not the same as the number of tags predicted/returned.