Getting started

Installation

The installation of CiteXtract is done by the following command:

pip install citextract

CiteXtract is automatically tested on Python 3.5 or newer.

Extracting references

In order to extract references, the following code is used:

from citextract.models.refxtract import RefXtractor

refxtractor = RefXtractor().load()
text = """This is a test sentence.\n[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
refs = refxtractor(text)
print(refs)

This might produce the following output:

['[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal.']

In the code, the RefXtract model is initialized and the model parameters are downloaded from the internet. Then, the model is executed on an example text. It returns a list of the found references in the text.

Extracting titles

In order to extract titles from references, the following code is used:

from citextract.models.titlextract import TitleXtractor

titlextractor = TitleXtractor().load()
ref = """[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
title = titlextractor(ref)
print(title)

This might produce the following output:

'This is a test title.'

In the code, the TitleXtract model is initialized and the model parameters are downloaded from the internet. Then, the model is executed on an example text. It returns a string of the found title in the reference.

Converting arXiv PDF to text

In order to get content for the RefXtract model, one can download a PDF from arXiv by using the following code:

from citextract.utils.pdf import convert_pdf_url_to_text

pdf_url = 'https://arxiv.org/pdf/some_file.pdf'
text = convert_pdf_url_to_text(pdf_url)