The Peptide Database Class
Michael:
This is the main class for keeping track of predicted peak data derived from protein sequence digests. In particular, this class should be able to answer the following queries:
- Report all peaks within a given mass range
- Report all peaks within a given retention time range
- Report all peaks within a given mass and retention time range
- Report all peaks belonging to a given protein
- Report all proteins that share a given peptide
- Report all proteins that share a given peptide within a certain mass and retention time range
Queries of the form (1-3) can be answered in time O(log(n)+k), n the number of peptide peaks and k the size of the answer, by using a kd-Tree index structure allowing orthogonal range queries on the two-dimensional data consisting of m/z and retention time for each peptide peak.
Queries of the form (4) can be answered in time O(k), k the size of the answer, by using a hashing function to associate to each numerical protein identifier its FASTA-ID and list of peptides.
Queries of the form (5-6) can be answered via the kd-Tree as well, using the mass of the peptide. Due to floating point and instrument accuracy issues, it is useful to allow range queries here as well, so we can find all protein that have a peptide of some mass m within some tolerance. We should implement a small wrapper-method that allows easier querying and translates the query for proteins to a range query.
