P3Utils¶
PATRIC Script Utilities¶
This module contains shared utilities for PATRIC 3 scripts.
Constants¶
These constants define the sort-of ER model for PATRIC.
OBJECTS¶
Mapping from user-friendly names to PATRIC names.
FIELDS¶
Mapping from user-friendly object names to default fields.
IDCOL¶
Mapping from user-friendly object names to ID column names.
DERIVED¶
Mapping from objects to derived fields. For each derived field name we have a list reference consisting of the function name followed by a list of the constituent fields.
Methods¶
data_options¶
my @opts = P3Utils::data_options();
This method returns a list of the Getopt::Long::Descriptive specifications for the common data retrieval options. These options include /delim_options plus the following.
attr
Names of the fields to return. Multiple field names may be specified by coding the option multiple times or separating the field names with commas. Mutually exclusive with
--count
.
count
If specified, a count of records found will be returned instead of the records themselves. Mutually exclusive with
--attr
.
equal
Equality constraints of the form field-name
,
value. If the field is numeric, the constraint will be an exact match. If the field is a string, the constraint will be a substring match. An asterisk in string values is interpreted as a wild card. Multiple equality constraints may be specified by coding the option multiple times.
lt, le, gt, ge, ne
Inequality constraints of the form field-name
,
value. Multiple constrains of each type may be specified by coding the option multiple times.
in
Multi-valued equality constraints of the form field-name
,
value1,
value2,
…,
valueN. The constraint is satisfied if the field value matches any one of the specified constraint values. Multiple constraints may be specified by coding the option multiple times.
required
Specifies the name of a field that must have a value for the record to be included in the output. Multiple fields may be specified by coding the option multiple times.
keyword
Specifies a keyword or phrase (in quotes) that should be included in any field of the output. This performs a text search against entire records.
debug
Display debugging information on STDERR.
col_options¶
my @opts = P3Utils::col_options($batchSize);
This method returns a list of the Getopt::Long::Descriptive specifications for the common column specification options. These options are as follows.
col
Index (1-based) of the column number to contain the key field. If a non-numeric value is specified, it is presumed to be the value of the header in the desired column. The default is
0
, which indicates the last column.
batchSize
Maximum number of lines to read in a batch. The default is
100
.
nohead
Input file has no headers.
The method takes as a parameter a default batch size to override the normal default of 100.
delim_options¶
my @options = P3Utils::delim_options();
This method returns a list of options related to delimiter specification for multi-valued fields.
delim
The delimiter to use between object names. The default is
::
. Specifytab
for tab-delimited output,space
for space-delimited output,semi
for a semicolon followed by a space, orcomma
for comma-delimited output. Other values might have unexpected results.
delim¶
my $delim = P3Utils::delim($opt);
Return the delimiter to use between the elements of multi-valued fields.
opt
A Getopts::Long::Descriptive::Opts object containing the delimiter specification.
undelim¶
my $undelim = P3Utils::undelim($opt);
Return the pattern to use to split the elements of multi-valued fields.
opt
A Getopts::Long::Descriptive::Opts object containing the delimiter specification.
get_couplets¶
my $couplets = P3Utils::get_couplets($ih, $colNum, $opt);
Read a chunk of data from a tab-delimited input file and return couplets. A couplet is a 2-tuple consisting of a key column followed by a reference to a list containing all the columns. The maximum number of couplets returned is determined by the batch size. If the input file is empty, an undefined value will be returned.
ih
Open input file handle for the tab-delimited input file.
colNum
Index of the key column.
opt
A Getopts::Long::Descriptive::Opts object containing the batch size specification.
RETURN
Returns a reference to a list of couplets.
get_col¶
my $column = P3Utils::get_col($ih, $colNum);
Read an entire column of data from a tab-delimited input file.
ih
Open input file handle for the tab-delimited input file, positioned after the headers.
colNum
Index of the key column.
RETURN
Returns a reference to a list of column values.
process_headers¶
my ($outHeaders, $keyCol) = P3Utils::process_headers($ih, $opt, $keyless);
Read the header line from a tab-delimited input, format the output headers and compute the index of the key column.
ih
Open input file handle.
opt
Should be a Getopts::Long::Descriptive::Opts object containing the specifications for the key column or a string containing the key column name. At a minimum, it must support the
nohead
option.
keyless (optional)
If TRUE, then it is presumed there is no key column.
RETURN
Returns a two-element list consisting of a reference to a list of the header values and the 0-based index of the key column. If there is no key column, the second element of the list will be undefined.
find_column¶
my $keyCol = P3Utils::find_column($col, \@headers, $optional);
Determine the correct (0-based) index of the key column in a file from a column specifier and the headers. The column specifier can be a 1-based index or the name of a header.
col
Incoming column specifier.
headers
Reference to a list of column header names.
optional (optional)
If TRUE, then failure to find the header is not an error.
RETURN
Returns the 0-based index of the key column or
undef
if the header was not found.
form_filter¶
my $filterList = P3Utils::form_filter($opt);
Compute the filter list for the specified options.
opt
A Getopt::Long::Descriptive::Opts object containing the command-line options that constrain the query (
--equal
,--in
).
RETURN
Returns a reference to a list of filter specifications for a call to P3DataAPI/query.
select_clause¶
my ($selectList, $newHeaders) = P3Utils::select_clause($p3, $object, $opt, $idFlag, \@default);
Determine the list of fields to be returned for the current query. If an --attr
option is present, its
listed fields are used. Otherwise, a default list is used.
p3
The P3DataAPI object used to access PATRIC.
object
Name of the object being retrieved–
genome
,feature
,protein_family
, orgenome_drug
.
opt
Getopt::Long::Descriptive::Opts object for the command-line options, including the
--attr
option.
idFlag
If TRUE, then only the ID column will be specified if no attributes are explicitly specified. and if attributes are explicitly specified, the ID column will be added if it is not present.
default
If specified, must be a reference to a list of field names. The named fields will be returned if no
--attr
option is passed in. This overrides the normal default fields.
RETURN
Returns a two-element list consisting of a reference to a list of the names of the fields to retrieve, and a reference to a list of the proposed headers for the new columns. If the user wants a count, the first element will be undefined, and the second will be a singleton list of
count
.
clean_value¶
my $cleaned = P3Utils::clean_value($value);
Clean up a value for use in a filter specification.
value
Value to clean up. Cleaning involves removing parentheses, illegal characters, and leading and trailing spaces.
RETURN
Returns a usable version of the incoming value.
get_data¶
my $resultList = P3Utils::get_data($p3, $object, \@filter, \@cols, $fieldName, \@couplets);
Return all of the indicated fields for the indicated entity (object) with the specified constraints. It should be noted that this method is simply a less-general interface to P3DataAPI/query that handles standard command-line script options for filtering.
p3
P3DataAPI object for accessing the database.
object
User-friendly name of the PATRIC object whose data is desired (e.g.
genome
,genome_feature
).
filter
Reference to a list of filter clauses for the query.
cols
Reference to a list of the names of the fields to return from the object, or
undef
if a count is desired.
fieldName (optional)
The name of the field in the specified object that is to be used as the key field. If an all-objects query is desired, then this parameter should be omitted.
couplets (optional)
A reference to a list of 2-tuples, each tuple consisting of a key value followed by a reference to a list of the values from the input row containing that key value.
RETURN
Returns a reference to a list of tuples containing the data returned by PATRIC, each output row appended to the appropriate input row from the couplets.
get_data_batch¶
my $resultList = P3Utils::get_data_batch($p3, $object, \@filter, \@cols, \@couplets, $keyField);
Return all of the indicated fields for the indicated entity (object) with the specified constraints. This version differs from /get_data in that the couplet keys are matched to a true key field (the matches are exact).
p3
P3DataAPI object for accessing the database.
object
User-friendly name of the PATRIC object whose data is desired (e.g.
genome
,feature
).
filter
Reference to a list of filter clauses for the query.
cols
Reference to a list of the names of the fields to return from the object, or
undef
if a count is desired.
couplets
A reference to a list of 2-tuples, each tuple consisting of a key value followed by a reference to a list of the values from the input row containing that key value.
keyfield (optional)
The key field to use. If omitted, the object’s ID field is used.
RETURN
Returns a reference to a list of tuples containing the data returned by PATRIC, each output row appended to the appropriate input row from the couplets.
get_data_keyed¶
my $resultList = P3Utils::get_data_keyed($p3, $object, \@filter, \@cols, \@keys, $keyField);
Return all of the indicated fields for the indicated entity (object) with the specified constraints. The query is by key, and the keys are split into batches to prevent PATRIC from overloading.
p3
P3DataAPI object for accessing the database.
object
User-friendly name of the PATRIC object whose data is desired (e.g.
genome
,feature
).
filter
Reference to a list of filter clauses for the query.
cols
Reference to a list of the names of the fields to return from the object, or
undef
if a count is desired.
keys
A reference to a list of key values.
keyfield (optional)
The key field to use. If omitted, the object’s ID field is used.
RETURN
Returns a reference to a list of tuples containing the data returned by PATRIC.
script_opts¶
my $opt = P3Utils::script_opts($parmComment, @options);
Process the command-line options for a P3 script. This method automatically handles the --help
option.
parmComment
A string indicating the command’s signature for the positional parameters. Used for the help display.
options
A list of options such as are expected by Getopt::Long::Descriptive.
RETURN
Returns the options object. Every command-line option’s value may be retrieved using a method on this object.
If invoked in array context, returns the options object, usage object pair so that the calling code may emit detailed usage messages if needed.
print_cols¶
P3Utils::print_cols(\@cols, %options);
Print a tab-delimited output row.
cols
Reference to a list of the values to appear in the output row.
options
A hash of options, including zero or more of the following.
oh
Open file handle for the output stream. The default is \*STDOUT.
opt
A Getopt::Long::Descriptive::Opts object containing the delimiter option, for computing the delimiter in multi-valued fields.
delim
The delimiter to use in multi-valued fields (overrides
opt
). The default, if neither this noropt
is specified, is a comma (,
).
ih¶
my $ih = P3Utils::ih($opt);
Get the input file handle from the options. If no input file is specified in the options, opens the standard input.
opt
Getopt::Long::Descriptive::Opts object for the current command-line options.
RETURN
Returns an open file handle for the script input.
ih_options¶
my @opt_specs = P3Utils::ih_options();
These are the command-line options for specifying a standard input file.
input
Name of the main input file. If omitted and an input file is required, the standard input is used.
oh¶
my $oh = P3Utils::oh($opt);
Get the output file handle from the options. If no output file is specified in the options, opens the standard output.
opt
Getopt::Long::Descriptive::Opts object for the current command-line options.
RETURN
Returns an open file handle for the script output.
oh_options¶
my @opt_specs = P3Utils::oh_options();
These are the command-line options for specifying a standard output file.
output
Name of the main output file. If omitted and an input file is required, the standard output is used.
match¶
my $flag = P3Utils::match($pattern, $key, %options);
Test a match pattern against a key value and return 1
if there is a match and 0
otherwise.
If the key is numeric, a numeric equality match is performed. If the key is non-numeric, then
we have a match if any subsequence of the words in the key is equal to the pattern (case-insensitive).
The goal here is to more or less replicate the SOLR eq operator.
pattern
The pattern to be matched. If
undef
, then any nonblank key matches.
key
The value against which to match the pattern.
options
Zero or more of the following keys, which modify the match.
exact
If TRUE, then non-numeric matches are exact.
RETURN
Returns
1
if there is a match, else0
.
protein_fasta¶
P3Utils::protein_fasta($p3, $genome, $fileName);
Create a FASTA file for the proteins in a genome.
p3
A P3DataAPI object for downloading from PATRIC.
genome
The ID of the genome whose proteins are desired.
fileName
The name of a file to contain the FASTA data, or an open output file handle to which the data should be written.
find_headers¶
my (\@headers, \@cols) = P3Utils::find_headers($ih, $fileType => @fields);
Search the headers of the specified input file for the named fields and return the list of headers plus a list of the column indices for the named fields.
ih
Open input file handle, or a reference to a list of headers.
fileType
Name to give the input file in error messages.
fields
A list of field names for the desired columns.
RETURN
Returns a two-element list consisting of (0) a reference to a list of the headers from the input file and (1) a reference to a list of column indices for the desired columns of the input, in order.
get_cols¶
my @values = P3Utils::get_cols($ih, $cols);
This method returns all the values in the specified columns of the next line of the input file, in order. It is meant to be used as a companion to /find_headers. A list reference can be used in place of an open file handle, in which case the columns will be used to index into the list.
ih
Open input file handle, or alternatively a list reference.
cols
Reference to a list of column indices.
RETURN
Returns a list containing the fields in the specified columns, in order.
get_fields¶
my @fields = P3Utils::get_fields($line);
Split a tab-delimited line into fields.
line
Input line to split, or an open file handle from which to get the next line.
RETURN
Returns a list of the fields in the line.
list_object_fields¶
my $fieldList = P3Utils::list_object_fields($p3, $object);
Return the list of field names for an object. The database schema is queried directly.
p3
The P3DataAPI object for accessing PATRIC.
object
The name of the object whose field names are desired.
RETURN
Returns a reference to a list of the field names.
Internal Methods¶
_process_entries¶
P3Utils::_process_entries($p3, $object, \@retList, \@entries, \@row, \@cols, $id, $keyField);
Process the specified results from a PATRIC query and store them in the output list.
p3
The P3DataAPI object for querying derived fields.
object
Name of the object queried.
retList
Reference to a list into which the output rows should be pushed.
entries
Reference to a list of query results from PATRIC.
row
Reference to a list of values to be prefixed to every output row.
cols
Reference to a list of the names of the columns to be put in the output row, or
undef
if the user wants a count.
id (optional)
Name of an ID field that should not be zero or empty. This is used to filter out invalid records.
keyField (optional)
Name of an ID field whose value should be put at the beginning of every output row.
_execute_query¶
P3Utils::_execute_query($p3, $core, $keyField, $dataField, \@keys, \%retHash, $multi);
Execute a query to get the data values associated with a key. The mapping from keys to data values is added to the specified hash.
p3
The P3DataAPI object for accessing the database.
core
The real name of the table containing the data.
keyField
The real name of the table’s key field.
dataField
The real name of the associated data field.
keys
A reference to a list of the keys whose data values are desired.
multi
If TRUE, then the related field will return multiple values.
retHash
A reference to a hash into which results should be placed.
_apply¶
my $result = _apply($function, @values);
Apply a computational function to values to produce a computed field value.
function
Name of the function.
altName
Pass the input value back unmodified.
concatSemi
Concatenate the sub-values using a semi-colon/space separator.
md5
Compute an MD5 for a DNA or protein sequence.
values
List of the input values.
RETURN
Returns the computed result.
_ec_parse¶
my @ecNums = P3Utils::_ec_parse($product);
Parse the EC numbers out of the functional assignment string of a feature.
product
The functional assignment string containing the EC numbers.
RETURN
Returns a list of EC numbers.
_select_list¶
my $fieldList = _select_list($object, $cols);
Compute the list of fields required to retrieve the specified columns. This includes the specified normal fields plus any derived fields.
object
Name of the object being retrieved.
cols
Reference to a list of field names.
RETURN
Returns a reference to a list of field names to retrieve.