Command-line scripts¶

pygenbank-search¶

pygenbank-search is a tool to perform searches on GenBank and to retrieve GenBank records, either as document summaries or as full records.

The user has to provide an email address for use of the Entrez resource.

Type pygenbank-search --help for detailed usage.

See http://www.ncbi.nlm.nih.gov/books/NBK49540/ for more details on GenBank search queries.

Examples¶

Please use your own email address as the --email argument.

Search GenBank and retrieve document summaries:

pygenbank-search --query "hemoglobin AND mammal" --retmax 10000 --email "name@address" > mySearch
less -S mysearch

Search GenBank and retrieve full records:

mkdir gbResults # Records will be saved here
pygenbank-search -q "myoglobin AND sperm whale" -e "name@address" -d -o gbResults

Specify a length range in the GenBank query:

pygenbank-search -q "carcinus maenas" -e "name@address" -r 10000 > mySearch
pygenbank-search -q "carcinus maenas AND 1000:100000[SLEN]" -e "name@address" -r 10000 > mySearchLength

Specify a taxon in the GenBank query:

MY_QUERY="complete genome AND staphylococcus aureus [PORGN]"
pygenbank-search -q "$MY_QUERY" -e "name@address" -r 10000 > mySearch

More complex query to get all complete Staphylococcus aureus genomes (up to 10000):

MY_QUERY="complete genome AND staphylococcus aureus [PORGN] AND 1000000:10000000 [SLEN]"
echo $MY_QUERY
pygenbank-search -q "$MY_QUERY" -e "name@address" -r 10000 > mySearch

pygenbank-extract-CDS¶

pygenbank-extract-CDS is a tool to extract CDS summaries from GenBank records and to produce fasta file with unique amino-acid sequences if needed (e.g. to prepare a clustering analysis).

Type pygenbank-extract-CDS --help for detailed usage.

Examples¶

Get CDS summaries for all GenBank files in the current directory:
```
pygenbank-extract-CDS *.gb > mySummaries
```

How to profile command-line script execution¶

cprofilev is a convenient tool to visualize the results of a profiling run of a Python script:

sudo pip install cprofilev

cprofilev can be used to profile the execution of a script this way:

python -m cprofilev myScript.py [args]

The ouput is visible at the address http://localhost:4000.

Using cprofilev with the command-line scripts¶

pygenbank-search and pygenbank-extract-CDS use entry points in the genbank.py module, and cannot be called directly with the Python interpreter to use the cprofilev module at the same time (at least I didn’t find a way to do it for now).

To solve this problem, there is a bit of code added at the end of the genbank.py module to make it callable from the python interpreter. The module can then be called with:

python genbank.py search [args]
python genbank.py extract-CDS [args]

where [args] are passed to the corresponding _main_... functions. For example:

python genbank.py search -q "hemocyanin" > summaries
python genbank.py extract-CDS --help

Note that the full path to genbank.py must be provided (so here we assume we are running the profiling run from within the module folder).

To perform a profiling run:

python -m cprofilev genbank.py extract-CDS -u toto.fasta *.gb > summaries

and then visit http://localhost:4000 with a web browser.

Command-line scripts¶

pygenbank-search¶

Examples¶

pygenbank-extract-CDS¶

Examples¶

How to profile command-line script execution¶

Using cprofilev with the command-line scripts¶

Table Of Contents

Search