Zachary Colburn



PubTator is an NCBI product that contains detailed annotations of abstracts found on PubMed. This makes it a very useful research tool. While PubTator does provide an API, the use of an API is inconvenient for high-throughput analyses and also requires a guaranteed internet connection. Querying a local PubTator database is better suited for high-throughput analyses. The package pubtatordb makes it easy to quickly start using a local copy of PubTator’s data.


You can install the released version of pubtatordb from CRAN with:


The version on GitHub can be downloaded using the devtools package with:



Load the package.

# Load the package.

After loading the package, database setup and querying can be accomplished in four steps.

After the user manually creates a folder to store the data, the user can define the path to that folder and then download the data to that location:

# Download the data.
# Use the full path. Writing to the temp directory is not recommended.
download_dir <- tempdir()

After defining the path to the download directory created above, the database can be created with:

# Define the data directory, a subdirectory of the above directory.
pubtator_path <- file.path(download_dir, "PubTator")

# Create the database.
  skip_behavior = FALSE,
  remove_behavior = TRUE

If the .gz files from PubTator have already been extracted, their extraction can be skipped with the skip_behavior argument. After their insertion into the database, both the .gz and uncompressed files can be removed using the remove_behavior argument.

A connection can be created to the database using pt_connector. Note that this is a wrapper for the dbConnect function of the DBI package.

# Create a connection to the database.
db_con <- pt_connector(pubtator_path)

Querying the data is accomplished using the pt_select function. The first five rows of the gene table can be selected with:

# Query the data.
  columns = NULL,
  keys = NULL,
  keytype = NULL,
  limit = 5

The first five results for PMIDs in which the genes with ENTREZ IDs 7356 or 4199 were mentioned can be selected with:

# Query the data.
  columns = c("PMID", "ENTREZID"),
  keys = c("7356", "4199"),
  keytype = "ENTREZID",
  limit = 5

Other tables

PubTator has several datasets. The names of tables in the database can be obtained with:


The column names for a particular table can be accessed with:

pt_columns(db_con, "species")


The citation information for PubTator can be found on the PubTator website or with:

#> Please cite PubTator in any publications:
#> 1. Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44
#> 2. Wei CH et. al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), bas041, 2012
#> 3. Wei CH et. al., PubTator: A PubMed-like interactive curation system for document triage and literature curation, in Proceedings of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012


The views expressed are those of the author(s) and do not reflect the official policy of the Department of the Army, the Department of Defense or the U.S. Government.