util
pymusas.taggers.rules.util
n_gram_indexes
def n_gram_indexes(
sequence: Sequence[Any],
min_n: int,
max_n: int
) -> Iterator[Tuple[int, int]]
Returns n-grams as indexes of the sequence,
in the range from max_n to min_n, in
order of largest n-grams first. If you only want one n-gram size then set
min_n equal to max_n, for example to get bi-grams indexes set both
min_n and max_n to 2.
Parameters¶
- sequence :
Sequence[Any]
The sequence to generate n-gram indexes from. - min_n :
int
Minimum size n-gram. Has to be greater than0. - max_n :
int
Maximim size n-gram. This has to be equal to or greater thanmin_n. If this is greater than the length of thesequencethen it is set to length of thesequence.
Returns¶
Iterator[Tuple[int, int]]
Raises¶
ValueError
Ifmin_nis less than1ormax_nis less thanmin_n.
Examples¶
from pymusas.taggers.rules.util import n_gram_indexes
tokens = ['hello', 'how', 'are', 'you', ',']
token_n_gram_indexes = n_gram_indexes(tokens, 2, 3)
expected_n_grams_indexes = [(0, 3), (1, 4), (2, 5), (0, 2), (1, 3), (2, 4), (3, 5)]
assert expected_n_grams_indexes == list(token_n_gram_indexes)
n_grams
def n_grams(
sequence: Sequence[Any],
min_n: int,
max_n: int
) -> Iterator[Sequence[Any]]
Returns n-grams, in the range from max_n to min_n, of the sequence in
order of largest n-grams first. If you only want one n-gram size then set
min_n equal to max_n, for example to get bi-grams set both min_n and
max_n to 2.
Parameters¶
- sequence :
Sequence[Any]
The sequence to generate n-grams from. - min_n :
int
Minimum size n-gram. Has to be greater than0. - max_n :
int
Maximim size n-gram. This has to be equal to or greater thanmin_n. If this is greater than the length of thesequencethen it is set to length of thesequence.
Returns¶
Iterator[Sequence[Any]]
Raises¶
ValueError
Ifmin_nis less than1ormax_nis less thanmin_n.
Examples¶
from pymusas.taggers.rules.util import n_grams
tokens = ['hello', 'how', 'are', 'you', ',']
token_n_grams = n_grams(tokens, 2, 3)
expected_n_grams = [['hello', 'how', 'are'], ['how', 'are', 'you'], ['are', 'you', ','],
['hello', 'how'], ['how', 'are'], ['are', 'you'], ['you', ',']]
assert expected_n_grams == list(token_n_grams)