LOVD Scripts

Reference Sequence Parser

The Reference Sequence Parser allows you to create coding DNA reference sequences for the genes in your database. For it to work, you will need to make the refseq directory writable. If you also want the script to generate a GenBank file for you, you'll also need to make the genbank directory writable.

Input formats Reference Sequence Parser
The Reference Sequence Parser accepts specific input formats for each step. Step 1 requires a GenBank file, and will generate the formats for steps 2 and 3 automatically, so starting at this step is recommended.
Step 2 requires the genomic sequence formatted to provide the positions of the upstream sequence, exons, introns, downstream sequence and the start of the translation.
Step 3 requires the coding DNA sequence formatted to provide the positions of the exon borders and the start of the translation.
All formats are case-insensitive. It does not matter whether you use capital or non-capital nucleotides.

Input format for step 1
Step 1 requires a valid GenBank file, uploaded into the 'genbank' directory. You can upload GenBank files with the GenBank File Uploader. Preferably, the GenBank file contains only your gene of interest and only one transcript is defined. If more than one transcript is defined, you will have to fill in the appropriate transcript_id and protein_id. If you do not fill in both fields, the first mRNA and CDS fields appearing in the file and associated with this gene will be selected.

Step 1 will create the correct format for step 2, so this is the recommended step to start out with.

Input format for step 2
UPSTREAM<EXON>INTRON<EXON>INTRON<EXON>INTRON (...etc...) <EXON>DOWNSTREAM
Make sure you include the starting point of the translation by putting a '|' in front of the 'a' of the 'atg' starting codon. If you started at step 1, this is done for you automatically.

Example:
cccccccc<gggggg|at>tttttttt<gggggggg>aaaaaaaa
will parse:
- 'cccccccc' as the upstream sequence.
- 'ggggggat' as exon 1, with the translation starting at the 'a'.
- 'tttttttt' as intron 1.
- 'gggggggg' as exon 2.
- 'aaaaaaaa' as the downstream sequence.

Step 2 will create the upstream sequence, intronic sequences and the downstream sequence, as well as exon lengths tables in html and text format, in which the exon start and end positions (in c. and g. numbering) and exon and intron lengths are provided. These are all saved in the refseq directory. Step 2 will also create the input for step 3 for you. Optionally, you can also have it create a file in GenBank format. You will then have to fill in the appropriate transcript_id, protein_id and db_xref numbers. The created file will meet the minimum requirements for uploading in Mutalyzer.

Input format for step 3
EXON;EXON;EXON; (...etc...) ;EXON
Make sure you include the starting point of the translation by putting a '|' in front of the 'a' of the 'atg' starting codon. If you started at step 1 or 2, the same start codon is used.

Example:
gggggg|at;gggggggg
will parse:
- 'ggggggat' as exon 1, with the translation starting at the 'a'.
- 'gggggggg' as exon 2.

Step 3 will create the coding DNA sequence including the translation and save it in the refseq directory.
If your gene does not have a reference sequence configured yet, this script will automatically add the created coding DNA reference sequence to the gene homepage.

For examples on how these reference sequences will look like, take a look at the reference sequences www.DMD.nl, such as the CAPN3 reference sequence.

« GenBank File Uploader Use of HTML within LOVD »