Skip to main content

rule_based

pymusas.spacy_api.taggers.rule_based

[SOURCE]


RuleBasedTagger​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __init__(
| self,
| name: str = 'pymusas_rule_based_tagger',
| pymusas_tags_token_attr: str = 'pymusas_tags',
| pymusas_mwe_indexes_attr: str = 'pymusas_mwe_indexes',
| pos_attribute: str = 'pos_',
| lemma_attribute: str = 'lemma_'
| ) -> None

spaCy pipeline component of the pymusas.taggers.rule_based.RuleBasedTagger.

This component applies one or more pymusas.taggers.rules.rule.Rules to create a list of possible candidate tags for each token in the sequence. Each candidate, represented as a pymusas.rankers.ranking_meta_data.RankingMetaData object, for each token is then Ranked using a pymusas.rankers.lexicon_entry.LexiconEntryRanker ranker. The best candidate and it's associated tag(s) for each token are then assigned to the Token._.pymusas_tags attribute in addition a List of token indexes indicating if the token is part of a Multi Word Expression (MWE) is assigned to the Token._.pymusas_mwe_indexes.

If we cannot tag a token then the following process will happen:

  1. If the token's POS tag is in default_punctuation_tags then it will assign the tag PUNCT.
  2. If the token's POS tag is in default_number_tags then it will assign the tag N1.
  3. Assign the default tag Z99.

NOTE this tagger has been designed to be flexible with the amount of resources avaliable, for example if you do not have a POS tagger or lemmatiser in your spaCy pipeline this ok, just use the default pos_attribute and lemma_attribute.

Assigned Attributes¶​

Location Type Value
Token._.pymusas_tags `List[str]` Prediced tags, the first tag in the List of tags is the most likely tag.
Token._.pymusas_mwe_indexes `List[Tuple[int, int]]` Each `Tuple` indicates the start and end token index of the associated Multi Word Expression (MWE). If the `List` contains more than one `Tuple` then the MWE is discontinuous. For single word expressions the `List` will only contain 1 `Tuple` which will be (token_start_index, token_start_index + 1).

Config and implementation¶​

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

SettingDescription
pymusas_tags_token_attrSee parameters section below
pymusas_mwe_indexes_attrSee parameters section below
pos_attributeSee parameters section below
lemma_attributeSee parameters section below

Parameters¶​

  • name : str, optional (default = pymusas_rule_based_tagger)
    The component name. Defaults to the same name as the class variable COMPONENT_NAME.
  • pymusas_tags_token_attr : str, optional (default = pymusas_tags)
    The name of the attribute to assign the predicted tags too under the Token._ class.
  • pymusas_mwe_indexes_attr : str, optional (default = pymusas_mwe_indexes)
    The name of the attribute to assign the start and end token index of the associated MWE too under the Token._ class.
  • pos_attribute : str, optional (default = pos_)
    The name of the attribute that the Part Of Speech (POS) tag is assigned too within the Token class. The POS tag value that comes from this attribute has to be of type str. With the current default we take the POS tag from Token.pos_. The POS tag can be an empty string if you do not require POS information or if you do not have a POS tagger. NOTE that if you do not have a POS tagger the default value for Token.pos_ is an empty string.
  • lemma_attribute : str, optional (default = lemma_)
    The name of the attribute that the lemma is assigned too within the Token class. The lemma value that comes from this attribute has to be of type str. With the current default we take the lemma from Token.lemma_. The lemma can be an empty string if you do not require lemma information or if you do not have a lemmatiser. NOTE that if you do not have a lemmatiser the default value for Token.lemma_ is an empty string.

Instance Attributes¶​

  • name : str
    The component name.
  • rules : List[pymusas.taggers.rules.rule.Rule], optional (default = None)
    The rules is set through the initialize method. Before it is set by the initialize method the value of this attribute is None.
  • ranker : pymusas.rankers.lexicon_entry.LexiconEntryRanker, optional (default = None)
    The ranker is set through the initialize method. Before it is set by the initialize method the value of this attribute is None.
  • default_punctuation_tags : Set[str]
    The default_punctuation_tags is set through the initialize method.
  • default_number_tags : Set[str]
    The default_number_tags is set through the initialize method.
  • pymusas_tags_token_attr : str, optional (default = pymusas_tags)
    The given pymusas_tags_token_attr
  • pymusas_mwe_indexes_attr : str, optional (default = pymusas_mwe_indexes)
    The given pymusas_mwe_indexes_attr
  • pos_attribute : str, optional (default = pos_)
    The given pos_attribute
  • lemma_attribute : str, optional (default = lemma_)
    The given lemma_attribute

Class Attributes¶​

  • COMPONENT_NAME : str
    Name of component factory that this component is registered under. This is used as the first argument to Language.add_pipe if you want to add this component to your spaCy pipeline.

Examples¶​

import spacy
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.pos_mapper import BASIC_CORCENCC_TO_USAS_CORE
from pymusas.lexicon_collection import LexiconCollection
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
# Construction via spaCy pipeline
nlp = spacy.blank('en')
# Using default config
single_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Welsh/semantic_lexicon_cy.tsv'
single_lexicon = LexiconCollection.from_tsv(single_lexicon_url)
single_lemma_lexicon = LexiconCollection.from_tsv(single_lexicon_url,
include_pos=False)
single_rule = SingleWordRule(single_lexicon, single_lemma_lexicon,
pos_mapper=BASIC_CORCENCC_TO_USAS_CORE)
rules = [single_rule]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = nlp.add_pipe('pymusas_rule_based_tagger')
tagger.rules = rules
tagger.ranker = ranker
token = nlp('aberth')
assert token[0]._.pymusas_tags == ['S9', 'A9-']
assert token[0]._.pymusas_mwe_indexes == [(0, 1)]
# Custom config
custom_config = {'pymusas_tags_token_attr': 'semantic_tags',
'pymusas_mwe_indexes_attr': 'mwe_indexes'}
nlp = spacy.blank('en')
tagger = nlp.add_pipe('pymusas_rule_based_tagger', config=custom_config)
tagger.rules = rules
tagger.ranker = ranker
token = nlp('aberth')
assert token[0]._.semantic_tags == ['S9', 'A9-']
assert token[0]._.mwe_indexes == [(0, 1)]

COMPONENT_NAME​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| COMPONENT_NAME = 'pymusas_rule_based_tagger'

initialize​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def initialize(
| self,
| get_examples: Optional[Callable[[], Iterable[Example]]] = None,
| *,
| nlp: Optional[Language] = None,
| rules: Optional[List[Rule]] = None,
| ranker: Optional[LexiconEntryRanker] = None,
| default_punctuation_tags: Optional[List[str]] = None,
| default_number_tags: Optional[List[str]] = None
| ) -> None

Initialize the tagger and load any of the resources given. The method is typically called by Language.initialize and lets you customize arguments it receives via the initialize.components block in the config. The loading only happens during initialization, typically before training. At runtime, all data is load from disk.

Parameters¶​

  • rules : List[pymusas.taggers.rules.rule.Rule]
    A list of rules to apply to the sequence of tokens in the __call__. The output from each rule is concatendated and given to the ranker.
  • ranker : pymusas.rankers.lexicon_entry.LexiconEntryRanker
    A ranker to rank the output from all of the rules.
  • default_punctuation_tags : List[str], optional (default = None)
    The POS tags that represent punctuation. If None then we will use ['punc']. The list will be converted into a Set before assigning to the default_punctuation_tags attribute.
  • default_number_tags : List[str], optional (default = None)
    The POS tags that represent numbers. If None then we will use ['num']. The list will be converted into a Set before assigning to the default_number_tags attribute.

__call__​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __call__(doc: Doc) -> Doc

Applies the tagger to the spaCy document, modifies it in place, and returns it. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

Parameters¶​

Returns¶​

  • Doc

to_bytes​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def to_bytes(
| self,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> bytes

Serialises the tagger to a bytestring.

Parameters¶​

  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns¶​

  • bytes

Examples¶​

from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
tagger_bytes = tagger.to_bytes()

from_bytes​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def from_bytes(
| self,
| bytes_data: bytes,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "RuleBasedTagger"

Loads the tagger from the given bytestring in place and returns it.

Parameters¶​

  • bytes_data : bytes
    The bytestring to load.
  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns¶​

Examples¶​

from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
# Create a new tagger, tagger 2
tagger_2 = RuleBasedTagger()
# Show that it is not the same as the original tagger
assert tagger_2.rules != rules
# Tagger 2 will now load in the data from the original tagger
_ = tagger_2.from_bytes(tagger.to_bytes())
assert tagger_2.rules == rules
assert tagger_2.ranker == ranker

to_disk​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def to_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> None

Serialises the tagger to the given path.

Parameters¶​

  • path : Union[str, Path]
    Path to a direcotry. Path may be either string or Path-like object. If the directory does not exist it attempts to create a directory at the given path.

  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns¶​

  • None

Examples¶​

from pathlib import Path
from tempfile import TemporaryDirectory
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)

from_disk​

class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def from_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "RuleBasedTagger"

Loads the tagger from the given path in place and returns it.

Parameters¶​

  • path : Union[str, Path]
    Path to an existing direcotry. Path may be either string or Path-like object.

  • exclude : Iterable[str], optional (default = SimpleFrozenList())
    This currently does not do anything, please ignore it.

Returns¶​

Examples¶​

from pathlib import Path
from tempfile import TemporaryDirectory
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
# Create an empty second tagger
tagger_2 = RuleBasedTagger()
assert tagger_2.rules is None
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
_ = tagger_2.from_disk(temp_dir)

assert tagger_2.rules is not None
assert tagger_2.rules == tagger.rules