Command-line scripts
====================

pygenbank-search
----------------

``pygenbank-search`` is a tool to perform searches on GenBank and to retrieve
GenBank records, either as document summaries or as full records.

The user has to provide an email address for use of the Entrez resource.

Type ``pygenbank-search --help`` for detailed usage.

See http://www.ncbi.nlm.nih.gov/books/NBK49540/ for more details on GenBank
search queries.

Examples
********

Please use your own email address as the ``--email`` argument.

* Search GenBank and retrieve document summaries::

    pygenbank-search --query "hemoglobin AND mammal" --retmax 10000 --email "name@address" > mySearch
    less -S mysearch
  
* Search GenBank and retrieve full records::

    mkdir gbResults # Records will be saved here
    pygenbank-search -q "myoglobin AND sperm whale" -e "name@address" -d -o gbResults

* Specify a length range in the GenBank query::

    pygenbank-search -q "carcinus maenas" -e "name@address" -r 10000 > mySearch
    pygenbank-search -q "carcinus maenas AND 1000:100000[SLEN]" -e "name@address" -r 10000 > mySearchLength

* Specify a taxon in the GenBank query::

    MY_QUERY="complete genome AND staphylococcus aureus [PORGN]"
    pygenbank-search -q "$MY_QUERY" -e "name@address" -r 10000 > mySearch

* More complex query to get all complete *Staphylococcus aureus* genomes (up to
  10000)::

    MY_QUERY="complete genome AND staphylococcus aureus [PORGN] AND 1000000:10000000 [SLEN]"
    echo $MY_QUERY
    pygenbank-search -q "$MY_QUERY" -e "name@address" -r 10000 > mySearch

pygenbank-extract-CDS
---------------------

``pygenbank-extract-CDS`` is a tool to extract CDS summaries from GenBank
records and to produce fasta file with unique amino-acid sequences if needed
(e.g. to prepare a clustering analysis).

Type ``pygenbank-extract-CDS --help`` for detailed usage.

Examples
********

* Get CDS summaries for all GenBank files in the current directory::

    pygenbank-extract-CDS *.gb > mySummaries

How to profile command-line script execution
--------------------------------------------

`cprofilev` is a convenient tool to visualize the results of a profiling run
of a Python script::

  sudo pip install cprofilev

`cprofilev` can be used to profile the execution of a script this way::

  python -m cprofilev myScript.py [args]

The ouput is visible at the address ``http://localhost:4000``.

Using `cprofilev` with the command-line scripts
***********************************************

`pygenbank-search` and `pygenbank-extract-CDS` use entry points in the
`genbank.py` module, and cannot be called directly with the Python interpreter
to use the `cprofilev` module at the same time (at least I didn't find a way to
do it for now).

To solve this problem, there is a bit of code added at the end of the
`genbank.py` module to make it callable from the python interpreter. The module
can then be called with::

  python genbank.py search [args]
  python genbank.py extract-CDS [args]

where `[args]` are passed to the corresponding `_main_...` functions. For
example::

  python genbank.py search -q "hemocyanin" > summaries
  python genbank.py extract-CDS --help

Note that the full path to `genbank.py` must be provided (so here we assume we
are running the profiling run from within the module folder).
  
To perform a profiling run::

  python -m cprofilev genbank.py extract-CDS -u toto.fasta *.gb > summaries

and then visit ``http://localhost:4000`` with a web browser.