citextract.models package¶
Submodules¶
citextract.models.refxtract module¶
RefXtract package.
-
class
citextract.models.refxtract.
BiRNN
(input_size, hidden_size, num_layers=1, num_classes=2, device=None)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Bidirectional RNN model.
-
forward
(x)¶ Forward-propagate the given input.
Parameters: x (torch.Tensor) – The tensor of size [batch_size, sequence_length, input_size] to forward-propagate. Returns: The output, which has a shape of [batch_size, sequence_length, num_classes]. Return type: torch.Tensor
-
-
class
citextract.models.refxtract.
RefXtractPreprocessor
(device=None)¶ Bases:
object
Preprocessor class for preprocessing textual data.
-
get_vocab_size
()¶ Compute the size of the vocabulary.
Returns: Size of the vocabulary. Return type: int
-
map_char
(char)¶ Map a given character to a normalized class representant.
Parameters: char (str) – The char to map. Returns: The mapped character. Return type: str
-
mapped_char_to_id
(mapped_char)¶ Map a character to an numerical identifier.
- mapped_char : str
- The mapped character that should be converted to its numerical representation.
Returns: The numerical representation of the character. Return type: int
-
-
class
citextract.models.refxtract.
RefXtractText
(text, idx)¶ Bases:
object
Simple helper class which contains the text and char indices of a given input.
-
class
citextract.models.refxtract.
RefXtractor
(model=None, preprocessor=None, device=None)¶ Bases:
object
RefXtractor class.
-
load
(model_uri=None, ignore_cache=False)¶ Load model parameters from the internet.
Parameters: - model_uri (str) – The model URI to load from.
- ignore_cache (bool) – When true, all caches are ignored and the model parameters are forcefully downloaded.
Returns: The wrapper itself.
Return type:
-
-
citextract.models.refxtract.
build_refxtract_model
(preprocessor, embed_size=128, hidden_size=128, device=None)¶ Build an instance of the RefXtract model.
Parameters: - preprocessor (RefXtractPreprocessor) – The preprocessor to use.
- embed_size (int) – The number of embedding neurons to use.
- hidden_size (int) – The number of hidden neurons to use.
- device (torch.device) – The device to compute on.
Returns: A RefXtract model instance.
Return type: torch.nn.modules.container.Sequential
-
citextract.models.refxtract.
extract_references
(text, preprocessor, model)¶ Extract references from a given text.
Parameters: - text (str) – The text to extract the references from.
- preprocessor (RefXtractPreprocessor) – The preprocessor to use.
- model (torch.nn.modules.container.Sequential) – The model to use.
Returns: A list containing the found references.
Return type: list
-
citextract.models.refxtract.
preprocess_reference_text
(text)¶ Preprocess a PDF text.
Parameters: text (str) – The text (possibly from a converted PDF) to preprocess. Returns: A tuple consisting of the following elements: - has_reference_section : A boolean which is true when the text contained the string ‘reference’ (not case-sensitive), false otherwise.- reference_section : A string containing the reference section.
- non_reference_section : A string containing the text which was not in the reference section.
Return type: tuple
citextract.models.titlextract module¶
The TitleXtract model.
-
class
citextract.models.titlextract.
TitleTagging
(input_size, hidden_size, n_layers, n_classes, device)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
TitleTagging model.
-
forward
(x)¶ Forward-propagate the input data.
Parameters: x (torch.Tensor) – The input tensor of size (batch_size, sequence_length, input_size). Returns: The output tensor of size (batch_size, sequence_length, n_classes). Return type: torch.Tensor
-
-
class
citextract.models.titlextract.
TitleXtractPreprocessor
(device=None)¶ Bases:
object
TitleXtract preprocessor.
-
map_text_chars
(text)¶ Map text to numerical character representations.
Parameters: text (str) – The text to map. Returns: The tensor representing the mapped characters. Return type: torch.Tensor
-
map_text_targets
(text, title)¶ Align and map the targets of a text.
Parameters: - text (str) – The text to map.
- title (str) – The title (substring of the text) to map.
Returns: A tensor representing the characters of the text for which an element is 1 if and only if a character is both represented by the text and by the title, 0 otherwise.
Return type: torch.Tensor
-
-
class
citextract.models.titlextract.
TitleXtractor
(model=None, preprocessor=None, device=None)¶ Bases:
object
TitleXtractor wrapper class.
-
load
(model_uri=None, ignore_cache=False)¶ Load model parameters from the internet.
Parameters: - model_uri (str) – The model URI to load from.
- ignore_cache (bool) – When true, all caches are ignored and the model parameters are forcefully downloaded.
Returns: The wrapper itself.
Return type:
-
-
citextract.models.titlextract.
build_titlextract_model
(preprocessor, embed_size=32, hidden_size=64, device=None)¶ Build an instance of the TitleXtract model.
Parameters: - preprocessor (TitleXtractPreprocessor) – The preprocessor to use.
- embed_size (int) – The number of embedding neurons to use.
- hidden_size (int) – The number of hidden neurons to use.
- device (torch.device) – The device to compute on.
Returns: A RefXtract model instance.
Return type: torch.nn.modules.container.Sequential
Module contents¶
Model definitions for the CiteXtract project.