Enhance/normalize "Picard Normalize Fasta" default settings and better link-in custom genome formatting #433

jennaj · 2017-01-14T01:15:52Z

Tool version https://toolshed.g2.bx.psu.edu/view/devteam/picard/00fe2ff64467

The default line length should be 80 (instead of 100).

Some tools are picky about line length and 40-80 works with most. 100 is not accepted by some tools (BLAST is one). We can help avoid this becoming a routine user-reported problem by adopting a common standard as the default. (example: NCBI's 80 char line length standard here

Address utility for formatting custom genomes better

Custom genomes with description content work with most mapping tools but then fail later for many many other downstream tools. This is a routine and frustrating problem users encounter. If it could be made more obvious that this tool will reformat a fasta file to be in specification with expected custom genome format it would be a big win. This won't help everyone, but ideally most.

Add "custom genome" to the tool name/description/text so it will be found by the tool search
Add note that to be in custom genome format, choose to split the title line (right under option)
Add note that to be in custom genome format, do not choose a line wrapping length outside of 40-80 bases (right under option)
Add in tool help about using the tool to achieve proper custom genome format - items from fasta help and the cg support wiki can be distilled and linked to tool form options to instruct how to properly do this specific reformatting task.

Error trapping

The tool should probably capture the content error case of more than one sequence having the same identifier (before OR after splitting the title line). This is useful for anyone. The tool doesn't necessarily need to fail when this occurs - it could be a checkbox option on the form: "Ignore duplicated sequence identifiers", with a default of NO, then underneath "Note: Use default NO if the fasta file is a custom genome"
The tool mostly ignores line content. Nearly any alphanumeric content is allowed in the "fasta" sequence lines. If it is not in spec, downstream tools will fail and the users is not sure what is wrong since they ran this tool (thinking that it confirmed format). This could be an opportunity to include fasta QC within a tool. Options could be to check for/report about/warn: ATCG only, ATCGN, ATCGN + nuc Iupac + ./-, protein AAs, pro AAs + pro IUPAC, etc - any flavor.

jennaj added the enhancement label Jan 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance/normalize "Picard Normalize Fasta" default settings and better link-in custom genome formatting #433

Enhance/normalize "Picard Normalize Fasta" default settings and better link-in custom genome formatting #433

jennaj commented Jan 14, 2017 •

edited

Loading

Enhance/normalize "Picard Normalize Fasta" default settings and better link-in custom genome formatting #433

Enhance/normalize "Picard Normalize Fasta" default settings and better link-in custom genome formatting #433

Comments

jennaj commented Jan 14, 2017 • edited Loading

jennaj commented Jan 14, 2017 •

edited

Loading