On this page

Export search results

This utility enables Mascot search results to be exported in a variety of "machine readable" formats. When used interactively, the file format is chosen and customised using a web browser form, displayed by choosing Export Search Results in the format controls of a results report and pressing Format As. In addition, the utility can be executed by scripts, with the options specified on the command line.

Custom XML and CSV

The information contained in these two formats is identical. XML is ideal for importing into a relational database. CSV can be opened in spreadsheets such as Microsoft Excel.

For a Peptide Mass Fingerprint, the result information is structured in a very similar way to a Concise Protein Summary report. For search results that include MS/MS data, you can choose whether to structure the protein list and associated peptide matches in a similar way to a Peptide Summary report or a Protein Family report. To create an export that contains information equivalent to a particular Mascot HTML report, the settings of the format controls must match, plus:

Type of search HTML Report Threshold type Protein Scoring Same-sets Sub-sets Group proteins

PMF Concise Protein Summary N/A N/A checked 1 N/A

MS/MS Peptide Summary Identity As format controls checked As format controls not checked

MS/MS Protein Family Report Homology MudPIT checked 1 checked

Precise details for individual data items, such as the data type and whether it is optional, can be found in the XML schema. The schema introduced with Mascot 2.1 is mascot_search_results_1.xsd, (documentation). The need to add additional data structures for Mascot 2.2, including quantitation results, would have broken this schema, so a new schema has been created: mascot_search_results_2.xsd, (documentation). For general XML Schema considerations, see the section further down this page. Documentation was auto-genarated using xs3p.

The CSV file contains identical data, organised for display as a spreadsheet. The column headers of tables are the same as the XML element names, but the row headers are plain text words and phrases. (If you need to change the delimiter to something other than a comma, edit export_dat_2.pl and change the value of $delimiter, near the top of the script.)

When quantitation information is exported in CSV format, there are no column headers. For the peptide level information, labels precede values, in-row. Protein level information follows, in-row, after the last peptide match of each hit. For each protein ratio, there are four values following each ratio label: protein ratio, number of peptide ratios used, SD(geo) for the peptide ratios, and an asterisk if the protein ratio is significantly different from unity. If the distribution is not normal, SD(geo) is in square brackets.

Protein quantitation information is only available if peptide quantitation has been selected. In addition to the peptide ratios displayed in the HTML report, the export also includes the intensity values for each component. These values are post-normalisation and post-isotope correction.

Usage

For interactive use, the controls are divided into blocks, with the first block corresponding to the format controls of a results report.

The Optional Search Information block controls which ancillary information is exported. Most of the options are self-explanatory.

The data items in the Header section are

Search title
Timestamp (W3C Date and Time format, e.g. 2005-03-12T08:29:11Z)
User
Email
Report URI (URL or relative path, if executed at command line)
MS data path
Search type
Mascot version
Database
Fasta file
Total sequences
Total residues
Sequences after taxonomy filter
Number of entries searched in error tolerant mode (if applicable)
Number of queries
Warnings messages from the search (as required)

For speed and efficiency, leave the checkboxes marked with asterisks under Optional Protein Hit Information unchecked. (See Optional Protein Hit Information for further infomation on the use of these checkboxes).

pepXML

pepXML is the interchange format for database search results used in the Institute for Systems Biology Trans-Proteomic Pipeline.

The pepXML format is only applicable to MS/MS search results, and represents "raw" peptide match data. Information is exported for all matches to all queries, (MS/MS spectra). For each match, extensive information is provided for the first protein in which the peptide is found and more limited information for all the other proteins. This can make the output file very large.

Precise details for individual data items, such as the data type and whether it is optional, can be found in the XML schema. Schema documentation has been generated by xs3p. For general XML Schema considerations, see the section further down this page.

Usage

For speed and efficiency, leave all the checkboxes under Optional Protein Hit Information unchecked. (See Optional Protein Hit Information for further infomation on the use of these checkboxes).

Limitations

Where elements and attributes are required by the schema, but the data is not available from Mascot, zero length strings are output. For example, the base_name, raw_data_type and raw_data attributes of an msms_run_summary element.
The schema includes extensive information for the first protein in which a peptide match is found, even though this may not be the preferred or final assignment.
The amino acid residues that bracket a peptide are only available if the result file is from Mascot 2.1 or later.
The num_matched_ions attribute of the search_hit element is the number of mass values used to score the match, not the total number of mass values that could be matched to all the calculated ion series.
In a search_result element, the start_scan and end_scan attributes are always set to 0.
modification_info elements are only exported for variable modifications, not for fixed.

mzIdentML

mzIdentML is the data exchange standard for database search results developed by the PSI Proteomics Informatics Standards Group. Originally, it was to be called analysisXML.

Precise details for individual data items, such as the data type and whether it is optional, can be found in the XML schema. Schema documentation is also available. For general XML Schema considerations, see the section further down this page. A semantic validator for mzIdentML documents has been developed by Andreas Bertsch as part of the OpenMS project and can be found here.

Usage

For speed and efficiency, leave all the checkboxes under Optional Protein Hit Information unchecked. (See Optional Protein Hit Information for further infomation on the use of these checkboxes).

Under Query Level Information, check Matched Fragment Ions to output tables of matching experimental and calculated m/z values for each peptide match. This is obviously time consuming and causes a substantial increase in the size of the output file. Check Export data for all Queries to output details for every MS/MS spectrum, including those that got no match to an exported protein and those that got no match at all. Again, this is time consuming and causes a substantial increase in the size of the output file.

DTASelect

DTASelect is an application that was written by David L. Tabb at The Scripps Research Institute. Originally intended for analysing Sequest results, it groups peptide matches into proteins and allows a variety of filters to be applied. Although DTASelect includes built-in support for Mascot result files, the information in the result file is not fully utilised and the interface is prone to break with new Mascot releases. Choosing DTASelect in this export utility creates a DTASelect intermediate file, DTASelect.txt, containing a more complete picture of the search results. This intermediate file is then read by DTASelect to create filtered reports.

The output file is compatible with DTASelect 1.9 only. DTASelect format is only applicable to MS/MS search results.

Usage

For speed and efficiency, it is advisable to choose MudPit scoring, an ions score cut-off of 10, and leave all the checkboxes under Optional Protein Hit Information unchecked. (See Optional Protein Hit Information for further infomation on the use of these checkboxes). Save the exported file to a directory, make this the current directory, and execute DTASelect.

The DTASelect spectrum filters, which can be supplied on the command line or taken from DTASelect.params, should include the following changes to the defaults:

--Mascot
to set Mascot mode
-1 10.0
to set the minimum ions score for 1+ peptides to 10
-2 10.0
to set the minimum ions score for 2+ peptides to 10
-3 10.0
to set the minimum ions score for 3+ peptides to 10
-d 20
to set the minimum for (1 / expectation value) to 20
-p 1
to set the distinct peptide threshold to 1
--mw 100.0
to set the minimum protein mass to 100

In a DTASelect report of Mascot results, the following columns are different from those in a DTASelect report of Sequest results:

Filename
Mascot result filename, query number and precursor charge, separated by periods
IonsScore
Mascot ions score
Signif
1 / expectation value
SpR
Peptide match rank, between 1 (highest) and 10 (lowest)
SpScore
Identity threshold score

Limitations

The output file is compatible with DTASelect 1.9 only
Hyperlinks to Sequest utilities will not work
The number of tryptic termini for a peptide is not available
The amino acid residues that bracket a peptide are only available if the result file is from Mascot 2.1 or later. For result files from earlier versions, question marks are displayed
DTASelect reports do not display variable terminus modifications

Mascot DAT File

A convenient way to dowload a copy of the "raw" Mascot result file. For security reasons, this will only succeed for result files in the daily directories under the Mascot data directory.

MGF Peak List

A convenient way to extract the peak list from a search result file. May be useful when you export an mzIdentML file, because the mzIdentML schema does not support inclusion of the peak list.

Optional Protein Hit Information

Only a limited amount of information about a protein hit is saved to a Mascot result file. For example, the protein sequence is not saved because this would make the result files unacceptably large. When missing information is required for a Mascot report, it has to be retrieved from the compressed database files.

Even though a single call for missing information may take only a fraction of a second, and is not noticable when loading a Mascot report, this can become a problem if creating an export file requires thousands of calls. It is important to be aware of this, and not waste time retrieving information that is not actually required. This is a particular issue for an export format that represents "raw" result information, like pepXML. A list of all the proteins that contain all the peptides that had any matches to any of the spectra can be an extremely long list.

Description
The Fasta description line is saved for all peptide mass fingerprint protein hits. For an MS/MS search, Mascot tries to guess which protein hits will appear in the reports and saves their Fasta description lines to the result file. However, the actual hit list depends on many factors, and some hits may be missed, requiring the descriptions to be retrieved from the compressed database files.
Protein Mass
The protein mass is saved for all peptide mass fingerprint protein hits. For an MS/MS search, Mascot tries to guess which protein hits will appear in the reports and saves their masses to the result file. However, the actual hit list depends on many factors, and some hits may be missed, requiring the masses to be retrieved from the compressed database files.

On the Matrx Science public web site, the description and mass of a protein can only be exported if this information was saved to the result file. The following protein hit information options are not available on the public web site, and attempting to use them will have no effect.

Percent coverage
Percent coverage is never saved to the result file. It is calculated on the fly from the length and the set of peptides assigned to the protein.
Length in residues
Length in residues is never saved to the result file. It must be retrieved from the compressed database files.
pI
pI is never saved to the result file. The protein sequence must be retrieved from the compressed database files and the pI value calculated.
Taxonomy
Taxonomy is never saved to the result file. It must be retrieved from the compressed database files.
Taxonomy ID
Taxonomy ID is never saved to the result file. It must be retrieved from the compressed database files.
Protein sequence
The entire protein sequence is never saved to the result file. It must be retrieved from the compressed database files.

Command Line Execution

Result file conversion can be automated by using the export script as a command line utility. It must be executed in the cgi directory on a Mascot server. The command line arguments are URL-style name=value pairs, for example

export_dat_2.pl file=../data/20120223/F004651.dat do_export=1 export_format=CSV ... pep_scan_title=1 > ../data/20120223/F004651.csv

The Mascot 2.1 script, which exports a customisable XML file conforming to mascot_search_results_1.xsd, is called export_dat.pl. For backward compatibility, this script is still installed with later versions of Mascot, but functionality is frozen. The current script, export_dat_2.pl, creates a customisable XML file conforming to mascot_search_results_2.xsd. This script is the one selected when you choose Export search results from a Mascot result report, and should be used by any new applications.

The easiest way to obtain the command-line arguments for a given output is to use the form based interface to adjust the settings then choose "Show command line arguments". The command line can then be copied and pasted as required. To direct the output to a file, add a > symbol followed by the path to the output file, as in the example above.

Required Arguments

do_export
must be 1 to export results
export_format
XML or CSV or pepXML or DTASelect or MascotDAT or mzIdentML or MGF
file
relative or absolute path to result file

Formatting Arguments

Many of these are specific to certain export formats

_ignoreionsscorebelow
MS/MS ions scores below this value are set to zero, default set in mascot.dat (0)
_mudpit (export_dat.pl only)
number of queries at which protein score switches to MudPIT scoring, default set in mascot.dat (1000)
_server_mudpit_switch (not export_dat.pl)
if queries / entries greater than this value, switch to MudPIT scoring, default set in mascot.dat (0.001)
_requireboldred
1 to report protein hits only if they include at least one bold, red peptide match, default set in mascot.dat (0)
_showallfromerrortolerant
1 to display all hits from an error tolerant search, including garbage, default 0
_onlyerrortolerant (not export_dat.pl)
1 to display only error tolerant matches from an automatic error tolerant search , default 0
_noerrortolerant (not export_dat.pl)
1 to suppress error tolerant matches from an automatic error tolerant search , default 0
show_same_sets
1 to display all proteins that match the same set of peptides, default 0
_showsubsets
display protein hits that are missing up to this fraction of the protein score of the main hit, default set in mascot.dat
_sigthreshold
probability significance threshold, default 0.05
report
max number of hits to be reported, 0 = AUTO, default taken from search parameters
unigene
UniGene index species to be used to cluster hits
_show_decoy_report (not export_dat.pl)
1 to display report for the decoy results in an automatic decoy search
use_homology (not export_dat.pl)
1 to use the homology threshold to calculate expect values, 0 to use the identity threshold, default 0
group_family (not export_dat.pl)
1 to emulating the protein grouping in the Protein Family Report, default 0
percolate (not export_dat.pl)
1 to display scores and expect values based on Percolator PEPs, default set in mascot.dat (0)
percolate_rt (not export_dat.pl)
1 to enable use of retention time by Percolator, default set in mascot.dat (0)
_prefertaxonomy (not export_dat.pl)
1-based integer index into the list of taxonomies in the Mascot taxonomy file. 0 means no preference.

Search Level Information

Set these options to 1 to include the corresponding block. For export_dat_2.pl, you must also include search_master=1. If search_master is missing or set to 0, this disables output of all the following except show_unassigned.

show_format
format parameters
show_header
search level information
show_masses
residue and element masses
show_params
search parameters
show_mods (not export_dat.pl)
modifications information
show_decoy (not export_dat.pl)
automatic decoy search statistics
show_queries (export_dat.pl only)
query information
show_unassigned
peptide matches not assigned to protein hits

Protein Hit Fields

Set these options to 1 to include the corresponding field. Only selected fields are available if the format is pepXML or DTASelect. For export_dat_2.pl, you must also include protein_master=1. If protein_master is missing or set to 0, this disables output of the following fields, (and also disables output of the peptide fields).

prot_desc
Fasta title / description line
prot_score
protein score
prot_thresh
protein score significance threshold (PMF only)
prot_expect
protein score expectation value (PMF only)
prot_mass
protein mass
prot_matches
number of assigned peptide matches
prot_cover
percentage of protein sequence covered by assigned peptide matches
prot_len
protein sequence length
prot_pi
calculated pI value for protein
prot_tax_str
protein taxonomy description
prot_tax_id
protein taxonomy ID number
prot_seq (not export_dat.pl)
complete protein sequence
prot_empai (not export_dat.pl)
emPAI
prot_quant (not export_dat.pl)
protein ratios from quantitation method

Peptide Match Fields

Set these options to 1 to include the corresponding field. Only selected fields are available if the format is pepXML or DTASelect. For export_dat_2.pl, you must also include peptide_master=1. If peptide_master is missing or set to 0, this disables output of the following fields.

pep_exp_mr
observed relative molecular mass
pep_exp_z
observed charge state
pep_calc_mr
calculated relative molecular mass
pep_delta
(pep_calc_mr - pep_exp_mr)
pep_start
1 based residue number for peptide start in protein
pep_end
1 based residue number for peptide end in protein
pep_miss
number of missed enzyme cleavage sites
pep_score
peptide match score
pep_homol
peptide score homology threshold
pep_ident
peptide score identity threshold
pep_expect
peptide match expectation value
pep_rank (export_dat.pl only)
rank of peptide match (1 - 10)
pep_seq
peptide sequence
pep_frame
peptide frame number (nucleic acid sequence databases only)
pep_var_mod
variable modifications used to get the match as comma separated string
pep_num_match (not export_dat.pl)
number of fragment ion matches used for scoring
pep_scan_title (not export_dat.pl)
scan title from peak list
pep_quant (not export_dat.pl)
peptide ratios from quantitation method

Query Fields (not export_dat.pl)

Set these options to 1 to include the corresponding field. Only selected fields are available if the format is pepXML or DTASelect. For export_dat_2.pl, you must also include query_master=1. If query_master is missing or set to 0, this disables output of the following fields.

query_title
scan title from peak list
query_qualifiers
seq, comp, tag, etc.
query_params
search parameters in local scope of a single query
query_peaks
peak list
query_raw
all peptide matches for this query

XML Schema

Versioning

The Mascot Search Results XML schema uses versioning to avoid applications breaking when the schema is updated. The schema definition is identified by a major version number and a minor version number.

When a change is made to the schema, and any instance document that was valid against the previous schema could become invalid, the major version number will be incremented. An example of such a change would be that a new type or element is added to the schema that is not optional. If a change is made to a schema that cannot break the validity of any existing document, such as adding a new type or element that is optional, then the minor version will be incremented.

There will be a seperate schema file and name space for each major version and the file name contains the major version number. The schema also includes the major and minor version numbers as attributes of the top level element. An application that parses an instance document should compare the major and minor version attribute values against those which it was coded to support. It should not rely on an XML parser to verify the version numbers against the schema encoded restrictions, since the schema definition file used by the parser may be newer than when the application was written.

Validation

The instance documents created by this export utility have been validated against the corresponding schema definitions using XMLSpy. The following web tools can also be used:

No complex software is ever completely free of bugs. If you find an XML file created by the Export Search Results utility that fails to validate against the corresponding schema definition, please email full details to support@matrixscience.com and we will try to fix the problem as rapidly as possible.

On the other hand, if the XML file validates, but an error is reported by the application reading the file, then this is a bug in the application. In the first instance, please report this to the authors of the application.

Useful Resources

Standards and design:

Programming:

Schema support in Xerces-C++ - includes SAX2 and DOM examples for overriding xsi:schemaLocation. Xerces 2.6 adds support for grammar caching which could be used to do the same thing by preloading the known schema files.
Properties supported by Xerces2-J - it should be possible to use the external-schemaLocation property in much the same way as the C++ version to override the schema locations. Alternatively, you could use grammar caching instead.
MSXML provides a XMLSchemaCache object for preloading schemas. There is a DOM example and it should be fairly similar in SAX2 except you would use the schemas property of ISAXXMLReader.