utils

usas_validator.utils.parse_usas_token_group(usas_tag_group_text, strict=False)[source]

Given a the string that represents the USAS tags whereby each USAS tag is separated by whitespace it is converted into a structured format.

This whitespace separation of USAS tags is the format that is produced by the original C version of the USAS tagger when it outputs USAS tags for a given token or meaningful word unit like a Multi Word Expression (MWE).

The whitespace separation can be one or more spaces, i.e. ` ` or ` `

A USAS tag can be also be PUNCT which represents punctuation. It can also be represented as Df or Df with an affix like +++ or mf etc.

Complex examples of usas_tag_group_text: L1 E3- O4.2- X5.2+ A6.2- A1.7- A7- W3 L2 F1 S1.2.4- Z2 Z2/S2mf Z3 O4.3 G1.2 G1.2/S2mf

Parameters:
  • usas_tag_group_text (str) – The string that represents the USAS tags produced by the USAS tagger for one token.

  • strict (bool, default: False) – If True, the function will raise an error if the USAS tags within the given text cannot be parsed as a USAS tag (see ValueError below). Default False.

Returns:

Structured format of the USAS tags that can be parsed from the given text. Any text that cannot be parsed as a USAS tag will be ignored and therefore can result in returning an empty list.

Return type:

list[USASTagGroup]

Raises:

ValueError – If strict is True and if the USAS tags within the given text cannot be parsed as a USAS tag, whereby each USAS tag after whitespace and / split should match the following regex: [A-Z](d+)((.d+)+)?, Df, or PUNCT.

Examples

>>> from usas_validator.utils import parse_usas_token_group
>>> usas_token_groups = parse_usas_token_group("Z2/S2mf Z3")
>>> for usas_token_group in usas_token_groups:
>>>     print(usas_token_group)
tags=[USASTag(tag='Z2', number_positive_markers=0, number_negative_markers=0, rarity_marker_1=False, rarity_marker_2=False, female=False, male=False, antecedents=False, neuter=False, idiom=False), USASTag(tag='S2', number_positive_markers=0, number_negative_markers=0, rarity_marker_1=False, rarity_marker_2=False, female=True, male=True, antecedents=False, neuter=False, idiom=False)]
tags=[USASTag(tag='Z3', number_positive_markers=0, number_negative_markers=0, rarity_marker_1=False, rarity_marker_2=False, female=False, male=False, antecedents=False, neuter=False, idiom=False)]

When using strict=True:

>>> from usas_validator.utils import parse_usas_token_group
>>> parse_usas_token_group("Invalid", strict=True)
ValueError: Cannot find the tag for this USAS tag text: Invalid

When using strict=False (default) you can ignore invalid USAS tags within the text you are parsing, in the example below Z1 and Z2 are parsed successfully while NONE is ignored:

>>> from usas_validator.utils import parse_usas_token_group
>>> parse_usas_token_group("Z1/NONE Z2", strict=False)
[USASTagGroup(tags=[USASTag(tag='Z1', number_positive_markers=0, number_negative_markers=0, rarity_marker_1=False, rarity_marker_2=False, female=False, male=False, antecedents=False, neuter=False, idiom=False)]), USASTagGroup(tags=[USASTag(tag='Z2', number_positive_markers=0, number_negative_markers=0, rarity_marker_1=False, rarity_marker_2=False, female=False, male=False, antecedents=False, neuter=False, idiom=False)])]
usas_validator.utils.load_usas_mapper(usas_tag_descriptions_file, tags_to_filter_out)[source]

Returns a dictionary of USAS tags and their descriptions.

Parameters:
  • usas_tag_descriptions_file (Path | None) – The path to the YAML file that contains the USAS tags and their descriptions. If None then the function will use the USAS tags and description file that is located within the package at usas_csv_auto_labeling/data/usas/usas_mapper.yaml.

  • tags_to_filter_out (set[str] | None) – A set of USAS tags to filter out.

Returns:

A dictionary of USAS tags and their descriptions.

Return type:

dict[str, str]

Raises:
  • FileNotFoundError – If the usas_tag_descriptions_file is not found.

  • ValueError – If the usas_tag_descriptions_file is not a file.

Examples

>>> from usas_validator.utils import load_usas_mapper
>>> usas_tag_descriptions = load_usas_mapper(None, None)
>>> usas_tag_descriptions["X1"]
title: General description: General terms relating to psychological actions, states and processes