rule_based
pymusas.spacy_api.taggers.rule_based
RuleBasedTagger​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __init__(
| self,
| name: str = 'pymusas_rule_based_tagger',
| pymusas_tags_token_attr: str = 'pymusas_tags',
| pymusas_mwe_indexes_attr: str = 'pymusas_mwe_indexes',
| pos_attribute: str = 'pos_',
| lemma_attribute: str = 'lemma_'
| ) -> None
spaCy pipeline component
of the pymusas.taggers.rule_based.RuleBasedTagger
.
This component applies one or more pymusas.taggers.rules.rule.Rule
s
to create a list of possible candidate tags for each token in the sequence.
Each candidate, represented as a
pymusas.rankers.ranking_meta_data.RankingMetaData
object, for each
token is then Ranked using a
pymusas.rankers.lexicon_entry.LexiconEntryRanker
ranker. The best
candidate and it's associated tag(s) for each token are then assigned to the
Token._.pymusas_tags
attribute in addition a List
of token indexes
indicating if the token is part of a Multi Word Expression (MWE) is assigned
to the Token._.pymusas_mwe_indexes
.
If we cannot tag a token then the following process will happen:
- If the token's POS tag is in
default_punctuation_tags
then it will assign the tagPUNCT
. - If the token's POS tag is in
default_number_tags
then it will assign the tagN1
. - Assign the default tag
Z99
.
NOTE this tagger has been designed to be flexible with the amount of
resources avaliable, for example if you do not have a POS tagger or
lemmatiser in your spaCy pipeline this ok, just use the default
pos_attribute
and lemma_attribute
.
Assigned Attributes¶​
Location | Type | Value |
---|---|---|
Token._.pymusas_tags | `List[str]` | Prediced tags, the first tag in the List of tags is the most likely tag. |
Token._.pymusas_mwe_indexes | `List[Tuple[int, int]]` | Each `Tuple` indicates the start and end token index of the associated Multi Word Expression (MWE). If the `List` contains more than one `Tuple` then the MWE is discontinuous. For single word expressions the `List` will only contain 1 `Tuple` which will be (token_start_index, token_start_index + 1). |
Config and implementation¶​
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the config
argument on nlp.add_pipe or in your
config.cfg for training.
Setting | Description |
---|---|
pymusas_tags_token_attr | See parameters section below |
pymusas_mwe_indexes_attr | See parameters section below |
pos_attribute | See parameters section below |
lemma_attribute | See parameters section below |
Parameters¶​
- name :
str
, optional (default =pymusas_rule_based_tagger
)
The component name. Defaults to the same name as the class variableCOMPONENT_NAME
. - pymusas_tags_token_attr :
str
, optional (default =pymusas_tags
)
The name of the attribute to assign the predicted tags too under theToken._
class. - pymusas_mwe_indexes_attr :
str
, optional (default =pymusas_mwe_indexes
)
The name of the attribute to assign the start and end token index of the associated MWE too under theToken._
class. - pos_attribute :
str
, optional (default =pos_
)
The name of the attribute that the Part Of Speech (POS) tag is assigned too within theToken
class. The POS tag value that comes from this attribute has to be of typestr
. With the current default we take the POS tag fromToken.pos_
. The POS tag can be an empty string if you do not require POS information or if you do not have a POS tagger. NOTE that if you do not have a POS tagger the default value forToken.pos_
is an empty string. - lemma_attribute :
str
, optional (default =lemma_
)
The name of the attribute that the lemma is assigned too within theToken
class. The lemma value that comes from this attribute has to be of typestr
. With the current default we take the lemma fromToken.lemma_
. The lemma can be an empty string if you do not require lemma information or if you do not have a lemmatiser. NOTE that if you do not have a lemmatiser the default value forToken.lemma_
is an empty string.
Instance Attributes¶​
- name :
str
The component name. - rules :
List[pymusas.taggers.rules.rule.Rule]
, optional (default =None
)
Therules
is set through theinitialize
method. Before it is set by theinitialize
method the value of this attribute isNone
. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
, optional (default =None
)
Theranker
is set through theinitialize
method. Before it is set by theinitialize
method the value of this attribute isNone
. - default_punctuation_tags :
Set[str]
Thedefault_punctuation_tags
is set through theinitialize
method. - default_number_tags :
Set[str]
Thedefault_number_tags
is set through theinitialize
method. - pymusas_tags_token_attr :
str
, optional (default =pymusas_tags
)
The givenpymusas_tags_token_attr
- pymusas_mwe_indexes_attr :
str
, optional (default =pymusas_mwe_indexes
)
The givenpymusas_mwe_indexes_attr
- pos_attribute :
str
, optional (default =pos_
)
The givenpos_attribute
- lemma_attribute :
str
, optional (default =lemma_
)
The givenlemma_attribute
Class Attributes¶​
- COMPONENT_NAME :
str
Name of component factory that this component is registered under. This is used as the first argument toLanguage.add_pipe
if you want to add this component to your spaCy pipeline.
Examples¶​
import spacy
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.pos_mapper import BASIC_CORCENCC_TO_USAS_CORE
from pymusas.lexicon_collection import LexiconCollection
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
# Construction via spaCy pipeline
nlp = spacy.blank('en')
# Using default config
single_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Welsh/semantic_lexicon_cy.tsv'
single_lexicon = LexiconCollection.from_tsv(single_lexicon_url)
single_lemma_lexicon = LexiconCollection.from_tsv(single_lexicon_url,
include_pos=False)
single_rule = SingleWordRule(single_lexicon, single_lemma_lexicon,
pos_mapper=BASIC_CORCENCC_TO_USAS_CORE)
rules = [single_rule]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = nlp.add_pipe('pymusas_rule_based_tagger')
tagger.rules = rules
tagger.ranker = ranker
token = nlp('aberth')
assert token[0]._.pymusas_tags == ['S9', 'A9-']
assert token[0]._.pymusas_mwe_indexes == [(0, 1)]
# Custom config
custom_config = {'pymusas_tags_token_attr': 'semantic_tags',
'pymusas_mwe_indexes_attr': 'mwe_indexes'}
nlp = spacy.blank('en')
tagger = nlp.add_pipe('pymusas_rule_based_tagger', config=custom_config)
tagger.rules = rules
tagger.ranker = ranker
token = nlp('aberth')
assert token[0]._.semantic_tags == ['S9', 'A9-']
assert token[0]._.mwe_indexes == [(0, 1)]
COMPONENT_NAME​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| COMPONENT_NAME = 'pymusas_rule_based_tagger'
initialize​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def initialize(
| self,
| get_examples: Optional[Callable[[], Iterable[Example]]] = None,
| *,
| nlp: Optional[Language] = None,
| rules: Optional[List[Rule]] = None,
| ranker: Optional[LexiconEntryRanker] = None,
| default_punctuation_tags: Optional[List[str]] = None,
| default_number_tags: Optional[List[str]] = None
| ) -> None
Initialize the tagger and load any of the resources given. The method is
typically called by
Language.initialize
and lets you customize arguments it receives via the
initialize.components
block in the config. The loading only happens during initialization,
typically before training. At runtime, all data is load from disk.
Parameters¶​
- rules :
List[pymusas.taggers.rules.rule.Rule]
A list of rules to apply to the sequence of tokens in the__call__
. The output from each rule is concatendated and given to theranker
. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
A ranker to rank the output from all of therules
. - default_punctuation_tags :
List[str]
, optional (default =None
)
The POS tags that represent punctuation. IfNone
then we will use['punc']
. The list will be converted into aSet
before assigning to thedefault_punctuation_tags
attribute. - default_number_tags :
List[str]
, optional (default =None
)
The POS tags that represent numbers. IfNone
then we will use['num']
. The list will be converted into aSet
before assigning to thedefault_number_tags
attribute.
__call__​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def __call__(doc: Doc) -> Doc
Applies the tagger to the spaCy document, modifies it in place, and
returns it. This usually happens under the hood when the nlp
object is
called on a text and all pipeline components are applied to the Doc
in
order.
Parameters¶​
- doc :
Doc
A spaCyDoc
Returns¶​
Doc
to_bytes​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def to_bytes(
| self,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> bytes
Serialises the tagger to a bytestring.
Parameters¶​
- exclude :
Iterable[str]
, optional (default =SimpleFrozenList()
)
This currently does not do anything, please ignore it.
Returns¶​
bytes
Examples¶​
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
tagger_bytes = tagger.to_bytes()
from_bytes​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def from_bytes(
| self,
| bytes_data: bytes,
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "RuleBasedTagger"
Loads the tagger from the given bytestring in place and returns it.
Parameters¶​
- bytes_data :
bytes
The bytestring to load. - exclude :
Iterable[str]
, optional (default =SimpleFrozenList()
)
This currently does not do anything, please ignore it.
Returns¶​
Examples¶​
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
# Create a new tagger, tagger 2
tagger_2 = RuleBasedTagger()
# Show that it is not the same as the original tagger
assert tagger_2.rules != rules
# Tagger 2 will now load in the data from the original tagger
_ = tagger_2.from_bytes(tagger.to_bytes())
assert tagger_2.rules == rules
assert tagger_2.ranker == ranker
to_disk​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def to_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> None
Serialises the tagger to the given path
.
Parameters¶​
path :
Union[str, Path]
Path to a direcotry. Path may be either string orPath
-like object. If the directory does not exist it attempts to create a directory at the givenpath
.exclude :
Iterable[str]
, optional (default =SimpleFrozenList()
)
This currently does not do anything, please ignore it.
Returns¶​
None
Examples¶​
from pathlib import Path
from tempfile import TemporaryDirectory
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
from_disk​
class RuleBasedTagger(spacy.pipeline.pipe.Pipe):
| ...
| def from_disk(
| self,
| path: Union[str, Path],
| *,
| exclude: Iterable[str] = SimpleFrozenList()
| ) -> "RuleBasedTagger"
Loads the tagger from the given path
in place and returns it.
Parameters¶​
path :
Union[str, Path]
Path to an existing direcotry. Path may be either string orPath
-like object.exclude :
Iterable[str]
, optional (default =SimpleFrozenList()
)
This currently does not do anything, please ignore it.
Returns¶​
Examples¶​
from pathlib import Path
from tempfile import TemporaryDirectory
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.spacy_api.taggers.rule_based import RuleBasedTagger
rules = [SingleWordRule({'example|noun': ['Z1']}, {})]
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
tagger = RuleBasedTagger()
tagger.initialize(rules=rules, ranker=ranker)
# Create an empty second tagger
tagger_2 = RuleBasedTagger()
assert tagger_2.rules is None
with TemporaryDirectory() as temp_dir:
_ = tagger.to_disk(temp_dir)
_ = tagger_2.from_disk(temp_dir)
assert tagger_2.rules is not None
assert tagger_2.rules == tagger.rules