lexicon_entry
pymusas.rankers.lexicon_entry
LexiconEntryRanker
class LexiconEntryRanker(Serialise)
An abstract class that defines the basic methods, __call__,
to_bytes, and from_bytes, that is required for all
LexiconEntryRankers.
Each lexicon entry match is represented by a
pymusas.rankers.ranking_meta_data.RankingMetaData object.
Lower ranked lexicon entry matches should be given priority when making tagging decisions. A rank of 0 is better than a rank of 1.
A LexcionEntryRanker when called, __call__, returns a tuple of two Lists
whereby each entry in the list corresponds to a token:
- Contains the ranks of the lexicon entry matches as a
List[int]. Note that theListcan be empty if a token has no lexicon entry matches. - An
Optional[RankingMetaData]that is the global lowest ranked entry match for that token. If the value isNonethen no global lowest ranked entry can be found for that token. If theRankingMetaDatarepresents more than one token, like a Multi Word Expression (MWE) match, then those associated tokens will have the sameRankingMetaDataobject as the global lowest ranked entry match.
The tagger will have to make a decision how to handle global lowest ranked
matches of value None, a suggested approach would be to assign an
unmatched/unknown semantic tag to those tokens.
The reason for the adding the second list is that the global lowest
ranked match is not the same as the local/token lowest ranked match, this is
due to the potential of overlapping matches, e.g. North East London brewery
can have a match of North East, North, and East London brewery in this
case the lowest rank for North would be North East, but as we have a
lower match that uses East which is East London brewery then the
global lowest rank for North would be North.
__call__
class LexiconEntryRanker(Serialise):
| ...
| @abstractmethod
| def __call__(
| self,
| token_ranking_data: List[List[RankingMetaData]]
| ) -> Tuple[List[List[int]], List[Optional[RankingMetaData]]]
For each token it returns a List of rankings for each lexicon entry
match and the optional pymusas.rankers.ranking_meta_data.RankingMetaData
object of the global lowest ranked match for each token.
Parameters¶
- token_ranking_data :
List[List[RankingMetaData]]
For each token aListofpymusas.rankers.ranking_meta_data.RankingMetaDatarepresenting the lexicon entry match.
Returns¶
Tuple[List[List[int]], List[Optional[RankingMetaData]]]
ContextualRuleBasedRanker
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def __init__(
| self,
| maximum_n_gram_length: int,
| maximum_number_wildcards: int
| ) -> None
The contextual rule based ranker creates ranks based on the rules stated below.
Each lexicon entry match is represented by a
pymusas.rankers.ranking_meta_data.RankingMetaData object.
Lower ranked lexicon entry matches should be given priority when making
tagging decisions. See the LexiconEntryRanker class docstring for
more details on the returned value of the __call__ method.
Ranking Rules:
The ranking of lexicon entires is based off the following rules, these rules are based on the 6 heuristic stated at the top of column 2 on page 4 of Piao et al. 2003:
First we create an initial ranking based on lexicon entry type:
- Multi Word Expression (MWE) entries ranked lower than single and Non-Special entries are ranked lower than wild card entires.
Then within these rankings we further rank based on:
- Longer entries, based on n-gram length, are ranked lower.
- Entries with fewer wildcards are ranked lower.
Then we apply the following contextual ranking rules:
- Whether the POS information was excluded in the match if so these are ranked
higher. This is only
Truewhen the match ignores the POS information for single word lexicon entries. This is alwaysFalsewhen used in a MWE lexicon entry match. - Whether the lexicon entry was matched on Token < Lemma < Lower cased token < Lower cased lemma. Token is the lowest ranked and lower cased lemma is highest.
- The lexicon entry that first appears in the text is ranked lowest, this is required for matches that do not apply to the same sequence of tokens.
In the case whereby the global lowest ranked lexicon entry match is joint ranked with another entry then it is random which lexicon entry match is chosen.
Parameters¶
- maximum_n_gram_length :
int
The largest n_gram rule match that will be encountered, e.g. a match ofski_noun boot_nounwill have a n-gram length of 2. - maximum_number_wildcards :
int
The number of wildcards in the rule that contains the most wildcards, e.g. the ruleski_* *_nounwould contain 2 wildcards. This can be 0 if you have no wildcard rules.
Instance Attributes¶
- n_gram_number_indexes :
int
The number of indexes that each n-gram length value should have when converting the n-gram length to a string usingpymusas.rankers.lexicon_entry.ContextualRuleBasedRanker.int_2_str. - wildcards_number_indexes :
int
The number of indexes that each wildcard count value should have when converting the wildcard count value to a string usingpymusas.rankers.lexicon_entry.ContextualRuleBasedRanker.int_2_str. - n_gram_ranking_dictionary :
Dict[int, int]
Maps the n-gram length to it's rank value, as the n-gram length is inverse to it's rank, as the larger the n-gram length the lower it's rank.
to_bytes
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def to_bytes() -> bytes
Serialises the ContextualRuleBasedRanker to a bytestring.
Returns¶
bytes
from_bytes
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def from_bytes(bytes_data: bytes) -> "ContextualRuleBasedRanker"
Loads ContextualRuleBasedRanker from the given bytestring and
returns it.
Parameters¶
- bytes_data :
bytes
The bytestring to load.
Returns¶
get_construction_arguments
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def get_construction_arguments(
| rules: List['Rule']
| ) -> Tuple[int, int]
Given a List of rules it will return the maximum_n_gram_length and
maximum_number_wildcards from the lexicon collections that those
pymusas.taggers.rules.rule.Rule(s) are based on. The output from
this function can then be used as the arguments to the constructor of
ContextualRuleBasedRanker.
Parameters¶
- rules :
List[Rule]
AListof rules. ThisListis typically required when creating apymusas.taggers.rule_based.RuleBasedTaggertagger.
Returns¶
Tuple[int, int]
Examples¶
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.mwe import MWERule
from pymusas.lexicon_collection import MWELexiconCollection
pt_mwe_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Portuguese/mwe-pt.tsv"
mwe_dict = MWELexiconCollection.from_tsv(pt_mwe_lexicon_url)
mwe_rule = MWERule(mwe_dict)
ranker_construction_arguments = ContextualRuleBasedRanker.get_construction_arguments([mwe_rule])
ranker = ContextualRuleBasedRanker(*ranker_construction_arguments)
int_2_str
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def int_2_str(int_value: int, number_indexes: int) -> str
Converts the integer, int_value, to a string with number_indexes,
e.g. 10 and 05 both have number_indexes of 2 and 001, 020,
and 211 have number_indexes of 3.
Parameters¶
- int_value :
int
The integer to converts to a string with the givennumber_indexes. - number_indexes :
int
The number of indexes theint_valueshould have in the returned string.
Returns¶
str
Raises¶
- ValueError
If thenumber_indexesof theint_valuewhen converted to a string is greater than the givennumber_indexes.
get_global_lowest_ranks
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def get_global_lowest_ranks(
| token_ranking_data: List[List[RankingMetaData]],
| token_rankings: List[List[int]],
| ranking_data_to_exclude: Optional[Set[RankingMetaData]] = None
| ) -> List[Optional[RankingMetaData]]
Returns the global lowest ranked entry match for each token. If the value
is None then no global lowest ranked entry can be found for that token.
If the RankingMetaData represents more than one token, like a Multi
Word Expression (MWE) match, then those associated tokens will have the
same RankingMetaData object as the global lowest ranked entry match.
Time Complexity, given N is the number of tokens, M is the number of unique ranking data, and P is the number of ranking data (non-unique) then the time complexity is:
O(N + P) + O(M log M) + O(M)
Parameters¶
- token_ranking_data :
List[List[RankingMetaData]]
For each token aListofpymusas.rankers.ranking_meta_data.RankingMetaDatarepresenting the lexicon entry match. - token_rankings :
List[List[int]]
For each token contains the ranks of the lexicon entry matches. Note that theListcan be empty if a token has no lexicon entry matches. - ranking_data_to_exclude :
Set[RankingMetaData], optional (default =None)
Anypymusas.rankers.ranking_meta_data.RankingMetaDatato exclude from the ranking selection, this can be useful when wanting to get the next best global rank for each token.
Raises¶
AssertionError
If the length oftoken_ranking_datais not equal to the length oftoken_rankings, for both the outer and innerLists.
Examples¶
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.rankers.ranking_meta_data import RankingMetaData
from pymusas.lexicon_collection import LexiconType
from pymusas.rankers.lexical_match import LexicalMatch
north_east = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 2, 0,
False, LexicalMatch.TOKEN, 0, 2,
'North_noun East_noun', ('Z1',))
east_london_brewery = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 3, 0,
False, LexicalMatch.TOKEN, 1, 4,
'East_noun London_noun brewery_noun', ('Z1',))
token_ranking_data = [
[
north_east
],
[
north_east,
east_london_brewery
],
[
east_london_brewery
],
[
east_london_brewery
]
]
token_rankings = [[120110], [120110, 110111], [110111], [110111]]
expected_lowest_ranked_matches = [None, east_london_brewery,
east_london_brewery, east_london_brewery]
assert (ContextualRuleBasedRanker.get_global_lowest_ranks(token_ranking_data, token_rankings, None)
== expected_lowest_ranked_matches)
Following on from the previous example, we now want to find the next best
global match for each token so we exclude the current best global match
for each token which is the east_london_brewery match:
expected_lowest_ranked_matches = [north_east, north_east, None, None]
ranking_data_to_exclude = {east_london_brewery}
assert (ContextualRuleBasedRanker.get_global_lowest_ranks(token_ranking_data, token_rankings,
ranking_data_to_exclude)
== expected_lowest_ranked_matches)
__call__
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def __call__(
| self,
| token_ranking_data: List[List[RankingMetaData]]
| ) -> Tuple[List[List[int]], List[Optional[RankingMetaData]]]
For each token it returns a List of rankings for each lexicon entry
match and the optional pymusas.rankers.ranking_meta_data.RankingMetaData
object of the global lowest ranked match for each token.
See the ranking rules in the class docstring for details on how each lexicon entry match is ranked.
Time Complexity, given N is the number of tokens, M is the number of unique ranking data, and P is the number of ranking data (non-unique) then the time complexity is:
O(3(N + P)) + O(M log M) + O(M)
Parameters¶
- token_ranking_data :
List[List[RankingMetaData]]
For each token aListofpymusas.rankers.ranking_meta_data.RankingMetaDatarepresenting the lexicon entry match.
Returns¶
Tuple[List[List[int]], List[Optional[RankingMetaData]]]
Examples¶
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.rankers.ranking_meta_data import RankingMetaData
from pymusas.lexicon_collection import LexiconType
from pymusas.rankers.lexical_match import LexicalMatch
north_east = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 2, 0,
False, LexicalMatch.TOKEN, 0, 2,
'North_noun East_noun', ('Z1',))
east_london_brewery = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 3, 0,
False, LexicalMatch.TOKEN, 1, 4,
'East_noun London_noun brewery_noun', ('Z1',))
token_ranking_data = [
[
north_east
],
[
north_east,
east_london_brewery
],
[
east_london_brewery
],
[
east_london_brewery
]
]
expected_ranks = [[120110], [120110, 110111], [110111], [110111]]
expected_lowest_ranked_matches = [None, east_london_brewery,
east_london_brewery, east_london_brewery]
ranker = ContextualRuleBasedRanker(3, 0)
assert ((expected_ranks, expected_lowest_ranked_matches)
== ranker(token_ranking_data))
__eq__
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def __eq__(other: object) -> bool
Given another object to compare too it will return True if the other
object is the same class and was initialised using with the same
maximum_n_gram_length and maximum_number_wildcards values.
Parameters¶
- other :
object
The object to compare too.
Returns¶
True