p3-identify-clusters

Identify Clusters in Genomes

p3-identify-clusters.pl [options] clusterFile <features.tbl

Given an input file of features and locations, this script find occurrences of functional clusters.

The cluster file should contain in its last column a list of the clustered identifiers (usually roles or protein family IDs), separated by a double colon delimited (::). The first column should contain a cluster ID of some sort. This is the default for the output of p3-generate-clusters.

The input file must include columns for the genome ID, the identifier used for clustering (again, usually roles or protein family IDs), the sequence ID, and the location. Features that are close together on the chromosome and belong to the same cluster will be output along with the sequence ID, start and end locations, and a cluster ID number.

Parameters

The positional parameter is the name of a tab-delimited file containing the clusters. The clusters must be in the first column, and consist of multiple clustered identifiers (roles or family IDs) separated by item delimiters (::).

The standard input can be overriddn using the options in Input Options. The standard input must be a tab-delimited file containing features. By default, the feature ID should be in a column named patric_id, the location in a column named location, the sequence ID in a column named sequence_id, and the clustered identifier (role or family) should be in the last column. The clustered identifier is considered the key column.

Additional command-line options are those given in Delimiter Options (to specify the delimiters between identifiers) and Column Options plus the following.

  • locCol

Index (1-based) or name of the location column in the input. The default is location.

  • idCol

Index (1-based) or name of the feature ID column in the input. The default is patric_id.

  • seqCol

Index (1-based) or name of the sequence ID column in the input. The default is sequence_id.

  • maxGap

The maximum gap between features for them to be considered part of a cluster. The default is 2000.

  • minItems

The minimum number of features for a group to be considered a cluster. The default is 3.

  • showRoles

If specified, the roles found in the cluster will be displayed in a column of the output.

  • showFids

If specified, the features found in the cluster will be displayed in a column of the output.