hybrid
pymusas.taggers.hybrid
HybridTagger
class HybridTagger(RuleBasedTagger):
| ...
| def __init__(
| self,
| rules: List[Rule],
| ranker: LexiconEntryRanker,
| neural_tagger: NeuralTagger,
| default_punctuation_tags: Optional[Set[str]] = None,
| default_number_tags: Optional[Set[str]] = None
| ) -> None
This is a hybrid tagger which uses both the pymusas.taggers.rule_based.RuleBasedTagger
and the pymusas.taggers.neural.NeuralTagger taggers. This tagger
inherits from the RuleBasedTagger. The difference between this and the
RuleBasedTagger is that this tagger will use the NeuralTagger to
tag tokens that the RuleBasedTagger cannot tag, these are the tokens that
will be tagged with the Z99 default tag using the RuleBasedTagger.
The tagger when called, through __call__, and given a sequence of
tokens and their associated linguistic data (lemma, Part Of Speech (POS))
will apply one or more pymusas.taggers.rules.rule.Rules
to create a list of possible candidate tags for each token in the sequence.
Each candidate, represented as a
pymusas.rankers.ranking_meta_data.RankingMetaData object, for each
token is then Ranked using a
pymusas.rankers.lexicon_entry.LexiconEntryRanker ranker. The best
candidate and it's associated tag(s) for each token are then returned along
with a List of token indexes indicating if the token is part of a Multi
Word Expression (MWE).
If we cannot tag a token then the following process will happen:
- If the token's POS tag is in
default_punctuation_tagsthen it will assign the tagPUNCT. - If the token's POS tag is in
default_number_tagsthen it will assign the tagN1. - Use the
NeuralTaggerto tag the token. The tags generated by theNeuralTaggerare determined by how you have initialised theNeuralTagger.
Parameters¶
- rules :
List[pymusas.taggers.rules.rule.Rule]
A list of rules to apply to the sequence of tokens in the__call__. The output from each rule is concatenated and given to theranker. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
A ranker to rank the output from all of therules. - neural_tagger :
pymusas.taggers.neural.NeuralTagger
TheNeuralTaggerthat will be used to tag tokens that theRuleBasedTaggercannot tag. - default_punctuation_tags :
Set[str], optional (default =None)
The POS tags that represent punctuation. IfNonethen we will use theSet:set(['punc']). - default_number_tags :
Set[str], optional (default =None)
The POS tags that represent numbers. IfNonethen we will use theSet:set(['num']).
Instance Attributes¶
- rules :
List[pymusas.taggers.rules.rule.Rule]
The givenrules. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
The givenranker. - neural_tagger :
pymusas.taggers.neural.NeuralTagger
TheNeuralTaggerthat will be used to tag tokens that theRuleBasedTaggercannot tag. - default_punctuation_tags :
Set[str]
The givendefault_punctuation_tags - default_number_tags :
Set[str]
The givendefault_number_tags
Examples¶
from pymusas.lexicon_collection import LexiconCollection
from pymusas.taggers.neural import NeuralTagger
from pymusas.taggers.hybrid import HybridTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
english_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/64dbdf19d8d090c6f4183984ff16529d09f77b02/English/semantic_lexicon_en.tsv'
lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(english_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup)
ranker = ContextualRuleBasedRanker(1, 0)
tokenizer_kwargs = {"add_prefix_space": True}
neural_tagger = NeuralTagger("ucrelnlp/PyMUSAS-Neural-English-Small-BEM",
device="cpu", top_n=2, tokenizer_kwargs=tokenizer_kwargs)
tagger = HybridTagger([single_word_rule], ranker, neural_tagger)
expected_tags_indices = [(['Z5'], [(0, 1)]), (['W3/M4', 'N5+'], [(1, 2)]),
(['N5.1+', 'I3.2+'], [(2, 3)]), (['Z5'], [(3, 4)]),
(['Z1', 'S2'], [(4, 5)])]
assert tagger(["The", "river", "full", "of", "creaturez"],
["the", "river", "full", "of", "creaturez"],
["DET", "NOUN", "ADJ", "ADP", "NOUN"]) == expected_tags_indices
__call__
class HybridTagger(RuleBasedTagger):
| ...
| def __call__(
| self,
| tokens: List[str],
| lemmas: List[str],
| pos_tags: List[str]
| ) -> List[Tuple[List[str],
| List[Tuple[int, int]]]]
Given a List of tokens, their associated lemmas and
Part Of Speech (POS) tags it returns for each token:
- A
Listof tags. The first tag in theListof tags is the most likely tag. - A
ListofTupleswhereby eachTupleindicates the start and end token index of the associated Multi Word Expression (MWE). If theListcontains more than oneTuplethen the MWE is discontinuous. For single word expressions theListwill only contain 1Tuplewhich will be (token_start_index, token_start_index + 1).
All the generated tags and MWEs are based on the rules, ranker, and NeuralTagger given to this model.
NOTE this tagger has been designed to be flexible with the amount of
resources available, if you do not have POS or lemma information assign
them a List of empty strings.
NOTE we recommend for the NeuralTagger that the number of tokens
in the list should represent a sentence, in addition the more tokens
in the list the more memory the NeuralTagger model requires and on
CPU at least the more time it will take to predict the tags.
Parameters¶
- tokens :
List[str]
A List of full text form of the tokens to be tagged. - lemmas :
List[str]
The List of lemma/base form of the tokens to be tagged. - pos_tags :
List[str]
The List of POS tags of the tokens to be tagged.
Returns¶
List[Tuple[List[str], List[Tuple[int, int]]]]
Raises¶
-
ValueError
If the length of thetokens,lemmas, andpos_tagsare not of the same length. -
ValueError
If the number of tokens given is not the same as the number of tags predicted/returned.