citextract.models package

Submodules

citextract.models.refxtract module

RefXtract package.

class citextract.models.refxtract.BiRNN(input_size, hidden_size, num_layers=1, num_classes=2, device=None)

Bases: sphinx.ext.autodoc.importer._MockObject

Bidirectional RNN model.

forward(x)

Forward-propagate the given input.

Parameters:x (torch.Tensor) – The tensor of size [batch_size, sequence_length, input_size] to forward-propagate.
Returns:The output, which has a shape of [batch_size, sequence_length, num_classes].
Return type:torch.Tensor
class citextract.models.refxtract.RefXtractPreprocessor(device=None)

Bases: object

Preprocessor class for preprocessing textual data.

get_vocab_size()

Compute the size of the vocabulary.

Returns:Size of the vocabulary.
Return type:int
map_char(char)

Map a given character to a normalized class representant.

Parameters:char (str) – The char to map.
Returns:The mapped character.
Return type:str
mapped_char_to_id(mapped_char)

Map a character to an numerical identifier.

mapped_char : str
The mapped character that should be converted to its numerical representation.
Returns:The numerical representation of the character.
Return type:int
class citextract.models.refxtract.RefXtractText(text, idx)

Bases: object

Simple helper class which contains the text and char indices of a given input.

class citextract.models.refxtract.RefXtractor(model=None, preprocessor=None, device=None)

Bases: object

RefXtractor class.

load(model_uri=None, ignore_cache=False)

Load model parameters from the internet.

Parameters:
  • model_uri (str) – The model URI to load from.
  • ignore_cache (bool) – When true, all caches are ignored and the model parameters are forcefully downloaded.
Returns:

The wrapper itself.

Return type:

RefXtractor

citextract.models.refxtract.build_refxtract_model(preprocessor, embed_size=128, hidden_size=128, device=None)

Build an instance of the RefXtract model.

Parameters:
  • preprocessor (RefXtractPreprocessor) – The preprocessor to use.
  • embed_size (int) – The number of embedding neurons to use.
  • hidden_size (int) – The number of hidden neurons to use.
  • device (torch.device) – The device to compute on.
Returns:

A RefXtract model instance.

Return type:

torch.nn.modules.container.Sequential

citextract.models.refxtract.extract_references(text, preprocessor, model)

Extract references from a given text.

Parameters:
  • text (str) – The text to extract the references from.
  • preprocessor (RefXtractPreprocessor) – The preprocessor to use.
  • model (torch.nn.modules.container.Sequential) – The model to use.
Returns:

A list containing the found references.

Return type:

list

citextract.models.refxtract.preprocess_reference_text(text)

Preprocess a PDF text.

Parameters:text (str) – The text (possibly from a converted PDF) to preprocess.
Returns:A tuple consisting of the following elements: - has_reference_section : A boolean which is true when the text contained the string ‘reference’
(not case-sensitive), false otherwise.
  • reference_section : A string containing the reference section.
  • non_reference_section : A string containing the text which was not in the reference section.
Return type:tuple

citextract.models.titlextract module

The TitleXtract model.

class citextract.models.titlextract.TitleTagging(input_size, hidden_size, n_layers, n_classes, device)

Bases: sphinx.ext.autodoc.importer._MockObject

TitleTagging model.

forward(x)

Forward-propagate the input data.

Parameters:x (torch.Tensor) – The input tensor of size (batch_size, sequence_length, input_size).
Returns:The output tensor of size (batch_size, sequence_length, n_classes).
Return type:torch.Tensor
class citextract.models.titlextract.TitleXtractPreprocessor(device=None)

Bases: object

TitleXtract preprocessor.

map_text_chars(text)

Map text to numerical character representations.

Parameters:text (str) – The text to map.
Returns:The tensor representing the mapped characters.
Return type:torch.Tensor
map_text_targets(text, title)

Align and map the targets of a text.

Parameters:
  • text (str) – The text to map.
  • title (str) – The title (substring of the text) to map.
Returns:

A tensor representing the characters of the text for which an element is 1 if and only if a character is both represented by the text and by the title, 0 otherwise.

Return type:

torch.Tensor

class citextract.models.titlextract.TitleXtractor(model=None, preprocessor=None, device=None)

Bases: object

TitleXtractor wrapper class.

load(model_uri=None, ignore_cache=False)

Load model parameters from the internet.

Parameters:
  • model_uri (str) – The model URI to load from.
  • ignore_cache (bool) – When true, all caches are ignored and the model parameters are forcefully downloaded.
Returns:

The wrapper itself.

Return type:

TitleXtractor

citextract.models.titlextract.build_titlextract_model(preprocessor, embed_size=32, hidden_size=64, device=None)

Build an instance of the TitleXtract model.

Parameters:
  • preprocessor (TitleXtractPreprocessor) – The preprocessor to use.
  • embed_size (int) – The number of embedding neurons to use.
  • hidden_size (int) – The number of hidden neurons to use.
  • device (torch.device) – The device to compute on.
Returns:

A RefXtract model instance.

Return type:

torch.nn.modules.container.Sequential

Module contents

Model definitions for the CiteXtract project.