genbank module

Description

genbank provides some light wrappers around Biopython functions to download records from the GenBank server. The module can be used from within Python, or through the command line tool pygenbank.

Tutorial

This is a simple tutorial to learn how to use the genbank module.

Setup the environment

import genbank as gb

# Setup your email address for Entrez
gb.Entrez.email = "yourname@youraddress"

Search GenBank and retrieve record ids

Performing a GenBank search is as simple as:

# Perform a GenBank search
mySearch = gb.search(term = "hemocyanin")

The returned value is a Bio.Entrez.Parser.DictionaryElement which contains information about the returned results:

mySearch.keys() # Available information
mySearch["Count"] # Number of entries found
mySearch["QueryTranslation"] # How the query was understood by GenBank
mySearch["IdList"] # List of GenBank identifiers returned

Any query string that you would be using on the GenBank web page can be used as the term:

mySearch = gb.search(term = "hemocyanin AND lito* [ORGN]")
mySearch["Count"]
mySearch = gb.search(term = "citrate synthase AND mus m* [ORGN]

To get more details about the returned entries, you can fetch the record summaries using the previous search result:

summaries = gb.getDocSum(mySearch)

The search results can be used to get summaries of the results and apply some simple filtering on the record id before proceeding to the actual record downloading:

# Get the summaries from the results
summaries = gb.getDocSum(mySearch)

# Extract the id of interest ("Gi" field)
myId = [x["Gi"] for x in summaries if int(x["Length"]) < 10000]
len(myId)

Download the GenBank records

# Download the GenBank records
genbank.downloadRecords(idList = myId, destDir = ".", batchSize = 20)

Functions

digraph G {
rankdir=LR;
subgraph cluster_1 {
node[shape=box,style=filled,fillcolor="#dfaf8f"];
_processOutfmtArg;
_fileLinesToList;
_makeSummaryForCDS;
_checkEmailOption;
_getRecordBatch;
_recordIsWGS;
_downloadBatch;
_processArgsToLogic_extract_CDS;
_parseDocSumXML;
_getProteinHashFromCDS;
_getDocSumXML;
_makeParser_search;
_checkRetmax;
_summarizeRecord;
_downloadWGS;
_makeWGSurl;
_makeParser_extract_CDS;
_processArgsToLogic_search;
node[shape=box,style=filled,fillcolor="#7cb8bb"];
getDocSumFromId;
downloadRecords;
getDocSum;
search;
downloadWGS;
writeDocSums;
node[shape=box,style=filled,fillcolor="#9fc59f"];
_main_search;
_main_extract_CDS;
getDocSum -> _getDocSumXML;
getDocSum -> _parseDocSumXML;
getDocSumFromId -> _getDocSumXML;
getDocSumFromId -> _parseDocSumXML;
downloadRecords -> _downloadBatch;
_processArgsToLogic_extract_CDS -> _processOutfmtArg;
downloadWGS -> _downloadWGS;
downloadWGS -> _makeWGSurl;
downloadWGS -> _recordIsWGS;
_downloadBatch -> downloadWGS;
_downloadBatch -> _getRecordBatch;
_downloadBatch -> _recordIsWGS;
_main_search -> search;
_main_search -> getDocSum;
_main_search -> _makeParser_search;
_main_search -> downloadRecords;
_main_search -> getDocSumFromId;
_main_search -> _fileLinesToList;
_main_search -> _processArgsToLogic_search;
_main_search -> writeDocSums;
_processArgsToLogic_search -> _checkRetmax;
_processArgsToLogic_search -> _checkEmailOption;
_main_extract_CDS -> _processArgsToLogic_extract_CDS;
_main_extract_CDS -> _makeParser_extract_CDS;
_main_extract_CDS -> _summarizeRecord;
_summarizeRecord -> _getProteinHashFromCDS;
_summarizeRecord -> _makeSummaryForCDS;
}
}

_CDSinfo(CDS, outfmt, fmtdictCDS=None, fmtdictRecord=None, parentRecord=None, hashConstructor=None)[source]

Get some information about a GenBank CDS (stored as a Bio.SeqFeature.SeqFeature object of type “CDS”).

Parameters:
  • CDS (Bio.SeqFeature.SeqFeature of type “CDS”) – GenBank CDS
  • outfmt (list of str) – List of information keys
  • fmtdictCDS (dict) – Dictionary mapping the information keys to simple functions to retrieve the corresponding information (for CDS attributes). If None, default is _GB_CDS_FMTDICT.
  • fmtdictRecord (dict) – Dictionary mapping the information keys to simple functions to retrieve the corresponding information (for record attributes). If None, default is _GB_RECORD_FMTDICT.
  • parentRecord (Bio.SeqRecord.SeqRecord) – Parent record, needed to extract nucleotide sequence and other record-related information
  • hashConstructor (function) – Hash algorithm to be used (from the hashlib module)
Returns:

Dictionary containing the information corresponding to each key in outfmt

Return type:

dict

_checkEmailOption(args, stderr=None)[source]

Check that an email option was provided and setup Entrez email, produce a message and exit if not.

Parameters:
  • args (namespace) – Output from parser.parse_args()
  • stderr (file) – Writable stderr stream (if None, use sys.stderr)
Returns:

Nothing, but setup Entrez.email or exit the program with a message to stderr if no email was provided

_checkRetmax(retmax, stderr)[source]
_downloadBatch(idBatch, destDir, downloadFullWGS=False)[source]

Download a batch of GenBank records to a destination directory. You should not call this function directly, but rather use downloadRecords() (which itself calls _downloadBatch()) for your downloads.

_downloadBatch() calls _getRecordBatch() to download data from GenBank, and then takes care of separating individual records and writing them to files.

Parameters:
  • idBatch (list of str) – List of GenBank id
  • destDir (str) – Path to the folder where the records will be saved
  • downloadFullWGS (boolean) – If True, also download the full GenBank files corresponding to GenBank records with WGS trace reference
Returns:

Nothing, but saves the GenBank records in the destination folder specified by destDir.

_downloadWGS(WGSurl)[source]

Download a WGS GenBank file. The output is an uncompressed version of the file.

Parameters:WGSurl (str) – Url to download the gzip file, output from _makeWGSurl()
Returns:Uncompressed GenBank file content. Return None if there was a problem during gzip decompression.
Return type:str
_fileLinesToList(filename)[source]

Simple function to get a list of stripped lines from a file.

Parameters:filename (str) – Path to the file to read
Returns:A list containing all non-empty white-stripped lines from the file
Return type:list of str
_getDocSumXML(searchResult, retmax=None)[source]

Fetch the documents summaries in XML format for the entries from an Entrez.esearch.

Parameters:
  • searchResult (Bio.Entrez.Parser.DictionaryElement) – Object containing the result of an Entrez.esearch, typically the output from search() (or at minimum a dictionary with WebEnv and QueryKey entries)
  • retmax (int) – Maximum number of document summaries to get. If no number is given, uses the RetMax element from searchResult.
Returns:

A string containing the summaries in XML format

Return type:

str

_getFullRecord(searchResult, retmax=1)[source]

Fetch the full GenBank records for the entries from an Entrez.esearch.

Parameters:
  • searchResult (Bio.Entrez.Parser.DictionaryElement) – Object containing the result of an Entrez.esearch, typically the output from search() (or at minimum a dictionary with WebEnv and QueryKey entries).
  • retmax (int) – Maximum number of full records to get. Default is 1, which is a safe approach since individual records can sometimes be very large (e.g. chromosomes).
Returns:

A string containing the full records in XML format

Return type:

str

_getProteinHashFromCDS(CDS, hashConstructor)[source]

Extract the protein sequence from a Bio.SeqFeature.SeqFeature CDS object and determine its hash value.

Parameters:
  • CDS (Bio.SeqFeature.SeqFeature) – CDS of interest
  • hashConstructor (function) – Hash algorithm to be used (from the hashlib module)
Returns:

A tuple containing the protein sequence and the hash value. If there is no translation available in the CDS qualifiers, returns “NA” as the protein sequence. If there are several translation available, join them with commas.

Return type:

(str, str)

_getRecordBatch(idList)[source]

Retrieve the GenBank records for a list of GenBank id (GIs). This is a relatively low-level function that only gets data from GenBank but does not manage batches or write files. You should use the higher level wrapper downloadRecords() for your own downloads.

Parameters:idBatch (list of str) – List of GenBank id
Returns:A string with all the data (can be large if many id are provided)
Return type:str
_main_extract_CDS(args=None, stdout=None, stderr=None, gb_record_fmtdict=None, gb_cds_fmtdict=None)[source]

Main function, used by the command line script “-extract-CDS”. This function sends the arguments to _processArgsToLogic_extract_CDS() to determine which actions must be performed, and then performs the actions.

Parameters:
  • args (namespace) – Namespace with script arguments, parse the command line arguments if None
  • stdout (file) – Writable stdout stream (if None, use sys.stdout)
  • stderr (file) – Writable stderr stream (if None, use sys.stderr)
  • gb_record_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records. If None, default is _GB_RECORD_FMTDICT.
  • gb_cds_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS. If None, default is _GB_CDS_FMTDICT.
Returns:

None

Main function, used by the command line script “-search”. This function sends the arguments to _processArgsToLogic_search() to determine which actions must be performed, and then performs the actions.

Parameters:
  • args (namespace) – Namespace with script arguments, parse the command line arguments if None
  • stdout (file) – Writable stdout stream (if None, use sys.stdout)
  • stderr (file) – Writable stderr stream (if None, use sys.stderr)
Returns:

None

_makeParser_extract_CDS()[source]

Build the argument parser for the main script “-extract-CDS”.

Returns:An argument parser object ready to be used to parse the command line arguments
Return type:argparse.ArgumentParser() object

Build the argument parser for the main script “-search”.

Returns:An argument parser object ready to be used to parse the command line arguments
Return type:argparse.ArgumentParser() object
_makeSummaryForCDS(record, CDS, hStr, summaryFormat, getAttrFuncs=None)[source]

Make a summary for one CDS feature object.

Parameters:
  • record (Bio.SeqRecord.SeqRecord) – Parent record
  • CDS (Bio.SeqFeature.SeqFeature) – CDS of interest
  • hStr (str) – Hash string for the protein sequence of the CDS
  • summaryFormat (list of str) – List of attribute descriptors determining the columns of the summary table
  • getAttrFuncs (dict of functions) – Dictionary mapping the attribute descriptors to functions of the form: f(CDS, record, hStr). If None (default), use the module-defined GET_ATTR_FUNCS dictionary.
Returns:

Summary string for this CDS

Return type:

str

_makeWGSurl(WGSline)[source]

Prepare the url to download the GenBank records corresponding to one WGS line. The WGS line is the output from _genbankRecirdIsWGS().

Parameters:WGSline (str) – WGS line from a GenBank record, output from _genbankRecirdIsWGS().
Returns:Url to download the record
Return type:str
_parseDocSumXML(xmlContent)[source]

Parse the documents summaries from xml format into a list of dictionaries.

Parameters:xmlContent (string) – Document summaries in XML format (note: this is a string, not a file name). This is typically the output from _getDocSumXML().
Returns:A list of dictionaries containing the document summaries, or an empty list if no entry was found.
Return type:list of dictionaries
_processArgsToLogic_extract_CDS(args, stdout, stderr, gb_record_fmtdict, gb_cds_fmtdict)[source]

Process the command line arguments and determine the action logic for the _main_extract_CDS() function.

Parameters:
  • args (namespace) – Argument namespace
  • stdout (file) – stdout stream
  • stderr (file) – stderr stream
  • gb_record_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records
  • gb_cds_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS
Returns:

The argument namespace given input in args with added flags for actions

Return type:

namespace

Process the command line arguments and determine the action logic for the _main_search() function.

Parameters:
  • args (namespace) – Argument namespace
  • stdout (file) – stdout stream
  • stderr (file) – stderr stream
Returns:

The argument namespace given input in args with added flags for actions

Return type:

namespace

_processOutfmtArg(outfmt, stderr, gb_record_fmtdict, gb_cds_fmtdict)[source]

Check that all outfmt specifiers are allowed and return the splitted outfmt specifiers.

Parameters:
  • outfmt (str) – String from the command line argument --outfmt
  • stderr (file) – stderr stream
  • gb_record_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records
  • gb_cds_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS
Returns:

List of outfmt specifiers ready to pass to _CDSinfo()

Return type:

list

_recordInfo(record, outfmt, fmtdict=None)[source]

Get some information about a GenBank record (stored as a Bio.SeqRecord.SeqRecord object).

Parameters:
  • record (Bio.SeqRecord.SeqRecord) – GenBank record
  • outfmt (list of str) – List of information keys
  • fmtdict (dict) – Dictionary mapping the information keys to simple functions to retrieve the corresponding information. If None, use the default _GB_RECORD_FMTDICT
Returns:

List containing the information corresponding to each key in outfmt

Return type:

List

_recordIsWGS(recordStr)[source]

Check if a GenBank record is from a whole genome shotgun project. This is done by searching for the “WGS ” string at the beginning of a line

Parameters:recordStr (str) – Content of a GenBank record
Returns:WGS line or False
Return type:str or False
_summarizeRecord(record, summaryFormat, hashConstructor, existingHashes={})[source]

Produce a tabular summary of all the CDS features present in a GenBank record and a dictionary containing hashes of the unique sequences (hash, protein sequence). If a dictionary of pre-existing hashes is given, update this one. Checks for collisions in the hash dictionary.

Parameters:
  • record (Bio.SeqRecord.SeqRecord) – GenBank record (Biopython object)
  • summaryFormat (list of str) – List of attribute descriptors determining the columns of the summary table
  • hashConstructor (function) – Hash algorithm to be used (from the hashlib module)
  • existingHashes (dict) – Dictionary (k, v) = (hash, protein sequences). This is updated with the CDS hashes from the input record and checked for collisions.
Returns:

A tuple containing the following objects

  • string: tabular summary for all CDS features in the GenBank record, ready to be written to an output stream
  • dictionary (hash, protein sequences): dictionary given in input as existingHashes and updated with the hashes from record. This is the dictionary one can pass to another call to _summarizeRecord() in order to progressively build a complete dictionary of all hashes for several GenBank records.
  • dictionary (hash, protein sequences): dictionary containing only the new hashes not already present in existingHashes. This is useful if one wants to update an output stream with the unique hashes after each call to _summarizeRecord() when processing several records.

Return type:

(str, dict, dict)

downloadRecords(idList, destDir, batchSize, delay=30, forceDownload=False, downloadFullWGS=False)[source]

Download the GenBank records for a list of IDs and save them in a destination folder. This is the function to use to download data from GenBank. It applies a waiting delay between each batch download. Each record is saved as a file with name id + ”.gb”.

Note that a record is not downloaded if a file with the expected name already exists, except if forceDownload is True.

The downloading itself is performed by _downloadBatch() and _getRecordBatch().

some GenBank records do not contain actual sequence data but some reference to a WGS (whole genome shotgun sequencing) project. For those, setting downloadFullWGS to True is necessary to download another GenBank file with the actual sequence data. Note that if a GenBank record was first downloaded without this option, and actually contains a WGS reference, then the forceDownload option must be enabled (or the file must be removed) for the WGS file to be also downloaded in a new call of this function.

Parameters:
  • idList (list of str) – List of GenBank id
  • destDir (str) – Path to the folder where the records will be saved
  • forceDownload (boolean) – Should records for which a destination file already exists be downloaded anyway? (default: False)
  • downloadFullWGS (boolean) – If True, also download the full GenBank files corresponding to GenBank records with WGS trace reference
Returns:

Nothing, but saves the GenBank records in the destination folder specified by destDir.

downloadWGS(gbRecord, destDir)[source]

Download and save the WGS GenBank file corresponding to a GenBank record with a WGS reference

Parameters:
  • gbRecord (str) – Text content of a GenBank record with WGS reference
  • destDir (str) – Path to the directory where to save the GenBank file
Returns:

Nothing, but save the complete GenBank file corresponding to the WGS reference into the specified folder. The file name is the GI number plus “WGS”

getDocSum(searchResult, retmax=None)[source]

Fetch the documents summaries for the entries from an Entrez.esearch.

Parameters:
  • searchResult (Bio.Entrez.Parser.DictionaryElement) – Object containing the result of an Entrez.esearch, typically the output from :func: search (or at minimum a dictionary with WebEnv and QueryKey entries)
  • retmax (int) – Maximum number of document summaries to get. If no number is given, uses the RetMax element from searchResult.
Returns:

A list of dictionaries containing the document summaries

Return type:

list of dictionaries

getDocSumFromId(listId, retmax=None)[source]

Fetch the documents summaries from a list of GenBank identifiers.

Parameters:
  • listId (list of str) – A list of GenBank identifiers
  • retmax (int) – Maximum number of document summaries to get. If None (default), returns all the document summaries.
Returns:

A list of dictionaries containing the document summaries

Return type:

list of dictionaries

search(term, retmax=None)[source]

Search GenBank for a given query string. Perform a Bio.Entrez.esearch on db=”nuccore”, using history.

Parameters:
  • term (str) – Query to submit to GenBank
  • retmax (int) – Maximum number of returned ids
Returns:

The result of the search, with detailed information accessible as if it was a Python dictionary. Keys are:

"Count", "RetMax", "IdList", "TranslationStack", "QueryTranslation",
"TranslationSet", "RetStart", "QueryKey", "WebEnv"

This object can be used with other function of this module to get the actual data (getDocSum() and _getFullRecord())

Return type:

Bio.Entrez.Parser.DictionaryElement

writeDocSums(docsums, handle)[source]

Write the documents summaries into a tabular format to any handle with a write method.

Parameters:
  • docsums (list of dictionaries or None) – A list of dictionaries containing the document summaries, typically the output from parseDocSumXML(). If it is an empty list, the function will not write anything.
  • handle (similar to file handle) – Handle object with a write method (e.g. open file, sys.stdout, StringIO object)
Returns:

Nothing but writes the document summaries in a tabular format to the specified file.