Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance/normalize "Picard Normalize Fasta" default settings and better link-in custom genome formatting #433

Open
jennaj opened this issue Jan 14, 2017 · 0 comments

Comments

@jennaj
Copy link
Member

jennaj commented Jan 14, 2017

Tool version https://toolshed.g2.bx.psu.edu/view/devteam/picard/00fe2ff64467

The default line length should be 80 (instead of 100).

Some tools are picky about line length and 40-80 works with most. 100 is not accepted by some tools (BLAST is one). We can help avoid this becoming a routine user-reported problem by adopting a common standard as the default. (example: NCBI's 80 char line length standard here

Address utility for formatting custom genomes better

Custom genomes with description content work with most mapping tools but then fail later for many many other downstream tools. This is a routine and frustrating problem users encounter. If it could be made more obvious that this tool will reformat a fasta file to be in specification with expected custom genome format it would be a big win. This won't help everyone, but ideally most.

  • Add "custom genome" to the tool name/description/text so it will be found by the tool search
  • Add note that to be in custom genome format, choose to split the title line (right under option)
  • Add note that to be in custom genome format, do not choose a line wrapping length outside of 40-80 bases (right under option)
  • Add in tool help about using the tool to achieve proper custom genome format - items from fasta help and the cg support wiki can be distilled and linked to tool form options to instruct how to properly do this specific reformatting task.

Error trapping

  1. The tool should probably capture the content error case of more than one sequence having the same identifier (before OR after splitting the title line). This is useful for anyone. The tool doesn't necessarily need to fail when this occurs - it could be a checkbox option on the form: "Ignore duplicated sequence identifiers", with a default of NO, then underneath "Note: Use default NO if the fasta file is a custom genome"

  2. The tool mostly ignores line content. Nearly any alphanumeric content is allowed in the "fasta" sequence lines. If it is not in spec, downstream tools will fail and the users is not sure what is wrong since they ran this tool (thinking that it confirmed format). This could be an opportunity to include fasta QC within a tool. Options could be to check for/report about/warn: ATCG only, ATCGN, ATCGN + nuc Iupac + ./-, protein AAs, pro AAs + pro IUPAC, etc - any flavor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant