genbank module¶

Description¶

genbank provides some light wrappers around Biopython functions to download records from the GenBank server. The module can be used from within Python, or through the command line tool pygenbank.

Tutorial¶

This is a simple tutorial to learn how to use the genbank module.

Setup the environment¶

import genbank as gb

# Setup your email address for Entrez
gb.Entrez.email = "yourname@youraddress"

Search GenBank and retrieve record ids¶

Performing a GenBank search is as simple as:

# Perform a GenBank search
mySearch = gb.search(term = "hemocyanin")

The returned value is a Bio.Entrez.Parser.DictionaryElement which contains information about the returned results:

mySearch.keys() # Available information
mySearch["Count"] # Number of entries found
mySearch["QueryTranslation"] # How the query was understood by GenBank
mySearch["IdList"] # List of GenBank identifiers returned

Any query string that you would be using on the GenBank web page can be used as the term:

mySearch = gb.search(term = "hemocyanin AND lito* [ORGN]")
mySearch["Count"]
mySearch = gb.search(term = "citrate synthase AND mus m* [ORGN]

To get more details about the returned entries, you can fetch the record summaries using the previous search result:

summaries = gb.getDocSum(mySearch)

The search results can be used to get summaries of the results and apply some simple filtering on the record id before proceeding to the actual record downloading:

# Get the summaries from the results
summaries = gb.getDocSum(mySearch)

# Extract the id of interest ("Gi" field)
myId = [x["Gi"] for x in summaries if int(x["Length"]) < 10000]
len(myId)

Download the GenBank records¶

# Download the GenBank records
genbank.downloadRecords(idList = myId, destDir = ".", batchSize = 20)

Functions¶

$digraph G { rankdir=LR; subgraph cluster_1 { node[shape=box,style=filled,fillcolor="#dfaf8f"]; _processOutfmtArg; _fileLinesToList; _makeSummaryForCDS; _checkEmailOption; _getRecordBatch; _recordIsWGS; _downloadBatch; _processArgsToLogic_extract_CDS; _parseDocSumXML; _getProteinHashFromCDS; _getDocSumXML; _makeParser_search; _checkRetmax; _summarizeRecord; _downloadWGS; _makeWGSurl; _makeParser_extract_CDS; _processArgsToLogic_search; node[shape=box,style=filled,fillcolor="#7cb8bb"]; getDocSumFromId; downloadRecords; getDocSum; search; downloadWGS; writeDocSums; node[shape=box,style=filled,fillcolor="#9fc59f"]; _main_search; _main_extract_CDS; getDocSum -> _getDocSumXML; getDocSum -> _parseDocSumXML; getDocSumFromId -> _getDocSumXML; getDocSumFromId -> _parseDocSumXML; downloadRecords -> _downloadBatch; _processArgsToLogic_extract_CDS -> _processOutfmtArg; downloadWGS -> _downloadWGS; downloadWGS -> _makeWGSurl; downloadWGS -> _recordIsWGS; _downloadBatch -> downloadWGS; _downloadBatch -> _getRecordBatch; _downloadBatch -> _recordIsWGS; _main_search -> search; _main_search -> getDocSum; _main_search -> _makeParser_search; _main_search -> downloadRecords; _main_search -> getDocSumFromId; _main_search -> _fileLinesToList; _main_search -> _processArgsToLogic_search; _main_search -> writeDocSums; _processArgsToLogic_search -> _checkRetmax; _processArgsToLogic_search -> _checkEmailOption; _main_extract_CDS -> _processArgsToLogic_extract_CDS; _main_extract_CDS -> _makeParser_extract_CDS; _main_extract_CDS -> _summarizeRecord; _summarizeRecord -> _getProteinHashFromCDS; _summarizeRecord -> _makeSummaryForCDS; } }$

_CDSinfo(CDS, outfmt, fmtdictCDS=None, fmtdictRecord=None, parentRecord=None, hashConstructor=None)[source]¶

Get some information about a GenBank CDS (stored as a Bio.SeqFeature.SeqFeature object of type “CDS”).

Parameters:	CDS (Bio.SeqFeature.SeqFeature of type “CDS”) – GenBank CDS outfmt (list of str) – List of information keys fmtdictCDS (dict) – Dictionary mapping the information keys to simple functions to retrieve the corresponding information (for CDS attributes). If None, default is _GB_CDS_FMTDICT. fmtdictRecord (dict) – Dictionary mapping the information keys to simple functions to retrieve the corresponding information (for record attributes). If None, default is _GB_RECORD_FMTDICT. parentRecord (Bio.SeqRecord.SeqRecord) – Parent record, needed to extract nucleotide sequence and other record-related information hashConstructor (function) – Hash algorithm to be used (from the `hashlib` module)
Returns:	Dictionary containing the information corresponding to each key in outfmt
Return type:	dict

_checkEmailOption(args, stderr=None)[source]¶

Check that an email option was provided and setup Entrez email, produce a message and exit if not.

Parameters:	args (namespace) – Output from parser.parse_args() stderr (file) – Writable stderr stream (if None, use sys.stderr)
Returns:	Nothing, but setup Entrez.email or exit the program with a message to stderr if no email was provided

_checkRetmax(retmax, stderr)[source]¶

_downloadBatch(idBatch, destDir, downloadFullWGS=False)[source]¶

Download a batch of GenBank records to a destination directory. You should not call this function directly, but rather use downloadRecords() (which itself calls _downloadBatch()) for your downloads.

_downloadBatch() calls _getRecordBatch() to download data from GenBank, and then takes care of separating individual records and writing them to files.

Parameters:	idBatch (list of str) – List of GenBank id destDir (str) – Path to the folder where the records will be saved downloadFullWGS (boolean) – If True, also download the full GenBank files corresponding to GenBank records with WGS trace reference
Returns:	Nothing, but saves the GenBank records in the destination folder specified by destDir.

_downloadWGS(WGSurl)[source]¶

Download a WGS GenBank file. The output is an uncompressed version of the file.

Parameters:	WGSurl (str) – Url to download the gzip file, output from `_makeWGSurl()`
Returns:	Uncompressed GenBank file content. Return None if there was a problem during gzip decompression.
Return type:	str

_fileLinesToList(filename)[source]¶

Simple function to get a list of stripped lines from a file.

Parameters:	filename (str) – Path to the file to read
Returns:	A list containing all non-empty white-stripped lines from the file
Return type:	list of str

_getDocSumXML(searchResult, retmax=None)[source]¶

Fetch the documents summaries in XML format for the entries from an Entrez.esearch.

Parameters:	searchResult (Bio.Entrez.Parser.DictionaryElement) – Object containing the result of an Entrez.esearch, typically the output from `search()` (or at minimum a dictionary with WebEnv and QueryKey entries) retmax (int) – Maximum number of document summaries to get. If no number is given, uses the RetMax element from searchResult.
Returns:	A string containing the summaries in XML format
Return type:	str

_getFullRecord(searchResult, retmax=1)[source]¶

Fetch the full GenBank records for the entries from an Entrez.esearch.

Parameters:	searchResult (Bio.Entrez.Parser.DictionaryElement) – Object containing the result of an Entrez.esearch, typically the output from `search()` (or at minimum a dictionary with WebEnv and QueryKey entries). retmax (int) – Maximum number of full records to get. Default is 1, which is a safe approach since individual records can sometimes be very large (e.g. chromosomes).
Returns:	A string containing the full records in XML format
Return type:	str

_getProteinHashFromCDS(CDS, hashConstructor)[source]¶

Extract the protein sequence from a Bio.SeqFeature.SeqFeature CDS object and determine its hash value.

Parameters:	CDS (Bio.SeqFeature.SeqFeature) – CDS of interest hashConstructor (function) – Hash algorithm to be used (from the `hashlib` module)
Returns:	A tuple containing the protein sequence and the hash value. If there is no translation available in the CDS qualifiers, returns “NA” as the protein sequence. If there are several translation available, join them with commas.
Return type:	(str, str)

_getRecordBatch(idList)[source]¶

Retrieve the GenBank records for a list of GenBank id (GIs). This is a relatively low-level function that only gets data from GenBank but does not manage batches or write files. You should use the higher level wrapper downloadRecords() for your own downloads.

Parameters:	idBatch (list of str) – List of GenBank id
Returns:	A string with all the data (can be large if many id are provided)
Return type:	str

_main_extract_CDS(args=None, stdout=None, stderr=None, gb_record_fmtdict=None, gb_cds_fmtdict=None)[source]¶

Main function, used by the command line script “-extract-CDS”. This function sends the arguments to _processArgsToLogic_extract_CDS() to determine which actions must be performed, and then performs the actions.

Parameters:

Parameters:	args (namespace) – Namespace with script arguments, parse the command line arguments if None stdout (file) – Writable stdout stream (if None, use sys.stdout) stderr (file) – Writable stderr stream (if None, use sys.stderr) gb_record_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records. If None, default is _GB_RECORD_FMTDICT. gb_cds_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS. If None, default is _GB_CDS_FMTDICT.
Returns:	None

args (namespace) – Namespace with script arguments, parse the command line arguments if None
stdout (file) – Writable stdout stream (if None, use sys.stdout)
stderr (file) – Writable stderr stream (if None, use sys.stderr)
gb_record_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records. If None, default is _GB_RECORD_FMTDICT.
gb_cds_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS. If None, default is _GB_CDS_FMTDICT.

Returns:

None

_main_search(args=None, stdout=None, stderr=None)[source]¶

Main function, used by the command line script “-search”. This function sends the arguments to _processArgsToLogic_search() to determine which actions must be performed, and then performs the actions.

Parameters:	args (namespace) – Namespace with script arguments, parse the command line arguments if None stdout (file) – Writable stdout stream (if None, use sys.stdout) stderr (file) – Writable stderr stream (if None, use sys.stderr)
Returns:	None

_makeParser_extract_CDS()[source]¶

Build the argument parser for the main script “-extract-CDS”.

Returns:	An argument parser object ready to be used to parse the command line arguments
Return type:	argparse.ArgumentParser() object

_makeParser_search()[source]¶

Build the argument parser for the main script “-search”.

Returns:	An argument parser object ready to be used to parse the command line arguments
Return type:	argparse.ArgumentParser() object

_makeSummaryForCDS(record, CDS, hStr, summaryFormat, getAttrFuncs=None)[source]¶

Make a summary for one CDS feature object.

Parameters:	record (Bio.SeqRecord.SeqRecord) – Parent record CDS (Bio.SeqFeature.SeqFeature) – CDS of interest hStr (str) – Hash string for the protein sequence of the CDS summaryFormat (list of str) – List of attribute descriptors determining the columns of the summary table getAttrFuncs (dict of functions) – Dictionary mapping the attribute descriptors to functions of the form: f(CDS, record, hStr). If None (default), use the module-defined GET_ATTR_FUNCS dictionary.
Returns:	Summary string for this CDS
Return type:	str

_makeWGSurl(WGSline)[source]¶

Prepare the url to download the GenBank records corresponding to one WGS line. The WGS line is the output from _genbankRecirdIsWGS().

Parameters:	WGSline (str) – WGS line from a GenBank record, output from `_genbankRecirdIsWGS()`.
Returns:	Url to download the record
Return type:	str

_parseDocSumXML(xmlContent)[source]¶

Parse the documents summaries from xml format into a list of dictionaries.

Parameters:	xmlContent (string) – Document summaries in XML format (note: this is a string, not a file name). This is typically the output from `_getDocSumXML()`.
Returns:	A list of dictionaries containing the document summaries, or an empty list if no entry was found.
Return type:	list of dictionaries

_processArgsToLogic_extract_CDS(args, stdout, stderr, gb_record_fmtdict, gb_cds_fmtdict)[source]¶

Process the command line arguments and determine the action logic for the _main_extract_CDS() function.

Parameters:	args (namespace) – Argument namespace stdout (file) – stdout stream stderr (file) – stderr stream gb_record_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records gb_cds_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS
Returns:	The argument namespace given input in args with added flags for actions
Return type:	namespace

_processArgsToLogic_search(args, stdout, stderr)[source]¶

Process the command line arguments and determine the action logic for the _main_search() function.

Parameters:	args (namespace) – Argument namespace stdout (file) – stdout stream stderr (file) – stderr stream
Returns:	The argument namespace given input in args with added flags for actions
Return type:	namespace

_processOutfmtArg(outfmt, stderr, gb_record_fmtdict, gb_cds_fmtdict)[source]¶

Check that all outfmt specifiers are allowed and return the splitted outfmt specifiers.

Parameters:	outfmt (str) – String from the command line argument `--outfmt` stderr (file) – stderr stream gb_record_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records gb_cds_fmtdict (dict) – Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS
Returns:	List of outfmt specifiers ready to pass to `_CDSinfo()`
Return type:	list

_recordInfo(record, outfmt, fmtdict=None)[source]¶

Get some information about a GenBank record (stored as a Bio.SeqRecord.SeqRecord object).

Parameters:	record (Bio.SeqRecord.SeqRecord) – GenBank record outfmt (list of str) – List of information keys fmtdict (dict) – Dictionary mapping the information keys to simple functions to retrieve the corresponding information. If None, use the default _GB_RECORD_FMTDICT
Returns:	List containing the information corresponding to each key in outfmt
Return type:	List

_recordIsWGS(recordStr)[source]¶

Check if a GenBank record is from a whole genome shotgun project. This is done by searching for the “WGS ” string at the beginning of a line

Parameters:	recordStr (str) – Content of a GenBank record
Returns:	WGS line or False
Return type:	str or False

_summarizeRecord(record, summaryFormat, hashConstructor, existingHashes={})[source]¶

Produce a tabular summary of all the CDS features present in a GenBank record and a dictionary containing hashes of the unique sequences (hash, protein sequence). If a dictionary of pre-existing hashes is given, update this one. Checks for collisions in the hash dictionary.

Parameters:

Parameters:	record (Bio.SeqRecord.SeqRecord) – GenBank record (Biopython object) summaryFormat (list of str) – List of attribute descriptors determining the columns of the summary table hashConstructor (function) – Hash algorithm to be used (from the `hashlib` module) existingHashes (dict) – Dictionary (k, v) = (hash, protein sequences). This is updated with the CDS hashes from the input record and checked for collisions.
Returns:	A tuple containing the following objects string: tabular summary for all CDS features in the GenBank record, ready to be written to an output stream dictionary (hash, protein sequences): dictionary given in input as existingHashes and updated with the hashes from record. This is the dictionary one can pass to another call to `_summarizeRecord()` in order to progressively build a complete dictionary of all hashes for several GenBank records. dictionary (hash, protein sequences): dictionary containing only the new hashes not already present in existingHashes. This is useful if one wants to update an output stream with the unique hashes after each call to `_summarizeRecord()` when processing several records.
Return type:	(str, dict, dict)

record (Bio.SeqRecord.SeqRecord) – GenBank record (Biopython object)
summaryFormat (list of str) – List of attribute descriptors determining the columns of the summary table
hashConstructor (function) – Hash algorithm to be used (from the hashlib module)
existingHashes (dict) – Dictionary (k, v) = (hash, protein sequences). This is updated with the CDS hashes from the input record and checked for collisions.

Returns:

A tuple containing the following objects

string: tabular summary for all CDS features in the GenBank record, ready to be written to an output stream
dictionary (hash, protein sequences): dictionary given in input as existingHashes and updated with the hashes from record. This is the dictionary one can pass to another call to _summarizeRecord() in order to progressively build a complete dictionary of all hashes for several GenBank records.
dictionary (hash, protein sequences): dictionary containing only the new hashes not already present in existingHashes. This is useful if one wants to update an output stream with the unique hashes after each call to _summarizeRecord() when processing several records.

Return type:

(str, dict, dict)

downloadRecords(idList, destDir, batchSize, delay=30, forceDownload=False, downloadFullWGS=False)[source]¶

Download the GenBank records for a list of IDs and save them in a destination folder. This is the function to use to download data from GenBank. It applies a waiting delay between each batch download. Each record is saved as a file with name id + ”.gb”.

Note that a record is not downloaded if a file with the expected name already exists, except if forceDownload is True.

The downloading itself is performed by _downloadBatch() and _getRecordBatch().

some GenBank records do not contain actual sequence data but some reference to a WGS (whole genome shotgun sequencing) project. For those, setting downloadFullWGS to True is necessary to download another GenBank file with the actual sequence data. Note that if a GenBank record was first downloaded without this option, and actually contains a WGS reference, then the forceDownload option must be enabled (or the file must be removed) for the WGS file to be also downloaded in a new call of this function.

Parameters:	idList (list of str) – List of GenBank id destDir (str) – Path to the folder where the records will be saved forceDownload (boolean) – Should records for which a destination file already exists be downloaded anyway? (default: False) downloadFullWGS (boolean) – If True, also download the full GenBank files corresponding to GenBank records with WGS trace reference
Returns:	Nothing, but saves the GenBank records in the destination folder specified by destDir.

downloadWGS(gbRecord, destDir)[source]¶

Download and save the WGS GenBank file corresponding to a GenBank record with a WGS reference

Parameters:	gbRecord (str) – Text content of a GenBank record with WGS reference destDir (str) – Path to the directory where to save the GenBank file
Returns:	Nothing, but save the complete GenBank file corresponding to the WGS reference into the specified folder. The file name is the GI number plus “WGS”

getDocSum(searchResult, retmax=None)[source]¶

Fetch the documents summaries for the entries from an Entrez.esearch.

Parameters:	searchResult (Bio.Entrez.Parser.DictionaryElement) – Object containing the result of an Entrez.esearch, typically the output from :func: search (or at minimum a dictionary with WebEnv and QueryKey entries) retmax (int) – Maximum number of document summaries to get. If no number is given, uses the RetMax element from searchResult.
Returns:	A list of dictionaries containing the document summaries
Return type:	list of dictionaries

getDocSumFromId(listId, retmax=None)[source]¶

Fetch the documents summaries from a list of GenBank identifiers.

Parameters:	listId (list of str) – A list of GenBank identifiers retmax (int) – Maximum number of document summaries to get. If None (default), returns all the document summaries.
Returns:	A list of dictionaries containing the document summaries
Return type:	list of dictionaries

search(term, retmax=None)[source]¶

Search GenBank for a given query string. Perform a Bio.Entrez.esearch on db=”nuccore”, using history.

Parameters:

Parameters:	term (str) – Query to submit to GenBank retmax (int) – Maximum number of returned ids
Returns:	The result of the search, with detailed information accessible as if it was a Python dictionary. Keys are: "Count", "RetMax", "IdList", "TranslationStack", "QueryTranslation", "TranslationSet", "RetStart", "QueryKey", "WebEnv" This object can be used with other function of this module to get the actual data (`getDocSum()` and `_getFullRecord()`)
Return type:	Bio.Entrez.Parser.DictionaryElement

term (str) – Query to submit to GenBank
retmax (int) – Maximum number of returned ids

Returns:

The result of the search, with detailed information accessible as if it was a Python dictionary. Keys are:

"Count", "RetMax", "IdList", "TranslationStack", "QueryTranslation",
"TranslationSet", "RetStart", "QueryKey", "WebEnv"

This object can be used with other function of this module to get the actual data (getDocSum() and _getFullRecord())

Return type:

Bio.Entrez.Parser.DictionaryElement

writeDocSums(docsums, handle)[source]¶

Write the documents summaries into a tabular format to any handle with a write method.

Parameters:	docsums (list of dictionaries or None) – A list of dictionaries containing the document summaries, typically the output from `parseDocSumXML()`. If it is an empty list, the function will not write anything. handle (similar to file handle) – Handle object with a write method (e.g. open file, sys.stdout, StringIO object)
Returns:	Nothing but writes the document summaries in a tabular format to the specified file.

genbank module¶

Description¶

Tutorial¶

Setup the environment¶

Search GenBank and retrieve record ids¶

Download the GenBank records¶

Functions¶

Table Of Contents

Search