You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The default line length should be 80 (instead of 100).
Some tools are picky about line length and 40-80 works with most. 100 is not accepted by some tools (BLAST is one). We can help avoid this becoming a routine user-reported problem by adopting a common standard as the default. (example: NCBI's 80 char line length standard here
Address utility for formatting custom genomes better
Custom genomes with description content work with most mapping tools but then fail later for many many other downstream tools. This is a routine and frustrating problem users encounter. If it could be made more obvious that this tool will reformat a fasta file to be in specification with expected custom genome format it would be a big win. This won't help everyone, but ideally most.
Add "custom genome" to the tool name/description/text so it will be found by the tool search
Add note that to be in custom genome format, choose to split the title line (right under option)
Add note that to be in custom genome format, do not choose a line wrapping length outside of 40-80 bases (right under option)
Add in tool help about using the tool to achieve proper custom genome format - items from fasta help and the cg support wiki can be distilled and linked to tool form options to instruct how to properly do this specific reformatting task.
Error trapping
The tool should probably capture the content error case of more than one sequence having the same identifier (before OR after splitting the title line). This is useful for anyone. The tool doesn't necessarily need to fail when this occurs - it could be a checkbox option on the form: "Ignore duplicated sequence identifiers", with a default of NO, then underneath "Note: Use default NO if the fasta file is a custom genome"
The tool mostly ignores line content. Nearly any alphanumeric content is allowed in the "fasta" sequence lines. If it is not in spec, downstream tools will fail and the users is not sure what is wrong since they ran this tool (thinking that it confirmed format). This could be an opportunity to include fasta QC within a tool. Options could be to check for/report about/warn: ATCG only, ATCGN, ATCGN + nuc Iupac + ./-, protein AAs, pro AAs + pro IUPAC, etc - any flavor.
The text was updated successfully, but these errors were encountered:
Tool version https://toolshed.g2.bx.psu.edu/view/devteam/picard/00fe2ff64467
The default line length should be 80 (instead of 100).
Some tools are picky about line length and 40-80 works with most. 100 is not accepted by some tools (BLAST is one). We can help avoid this becoming a routine user-reported problem by adopting a common standard as the default. (example: NCBI's 80 char line length standard here
Address utility for formatting custom genomes better
Custom genomes with description content work with most mapping tools but then fail later for many many other downstream tools. This is a routine and frustrating problem users encounter. If it could be made more obvious that this tool will reformat a fasta file to be in specification with expected custom genome format it would be a big win. This won't help everyone, but ideally most.
Error trapping
The tool should probably capture the content error case of more than one sequence having the same identifier (before OR after splitting the title line). This is useful for anyone. The tool doesn't necessarily need to fail when this occurs - it could be a checkbox option on the form: "Ignore duplicated sequence identifiers", with a default of NO, then underneath "Note: Use default NO if the fasta file is a custom genome"
The tool mostly ignores line content. Nearly any alphanumeric content is allowed in the "fasta" sequence lines. If it is not in spec, downstream tools will fail and the users is not sure what is wrong since they ran this tool (thinking that it confirmed format). This could be an opportunity to include fasta QC within a tool. Options could be to check for/report about/warn: ATCG only, ATCGN, ATCGN + nuc Iupac + ./-, protein AAs, pro AAs + pro IUPAC, etc - any flavor.
The text was updated successfully, but these errors were encountered: