Sigenae Public
Sign in or register

About Sigenae

fold faq What is SIGENAE?

SIGENAE is a group of INRA bio-informaticians providing services to biologists working on 6 species (cattle, chicken, pig, rabbit, sheep, trout). The members of the group are located in thee sites (Rennes, Toulouse, Tours) and are part of four INRA departments. SIGENAE is also the name given to the Information System this team is developing and maintaining.

fold faq Who we are?

Have a look at the team page.

fold faq What tools or service do we provide?

Services / Users categories

Everybody

Genanimal

Other projects

Sequence cleaning

Yes*

Yes

Yes

Sequence clustering

Yes*

Yes

Yes

Library Statistics

Yes*

Yes

Yes

Redundancy calculation

Yes*

Yes

Yes

MicroArray data storage and processing

No

Yes

Yes

Specific data processing

No

Yes

Yes

Training

No

Yes

Yes

(*) only on public sequences found in dbEST and Genbank.

fold faq What are the projects we are working on?

We work mainly on AGENAE.

We are also part of EADGENE and AQUAFIRST.

fold faq How to interact with us?

You can use the mail address and phone numbers found on the team page.
Or use the contact page.

fold faq Citing us

Depending on the help provided you can cite us in acknowledgements, references or both.

Examples :

Acknowledgements
We wish to thank the SIGENAE group for ....

References
X. SIGENAE [http://www.sigenae.org/]

fold faq Citations

Papers

Posters

Others

Some Definitions

fold faq What means SIGENAE?

Système d'Information d'(Information System of) AGENAE.

fold faq What is an EST?

EST means Expressed Sequence Tag. It is short nucleotids sequences of cDNA, corresponding to coding STS (Sequence Tagged Sites) sequences, and specific to a mRNA end.

fold faq What are hits?

For SIGENAE, hits are the most homologous sequences to our consensus sequences, after a BlastX (nucleotid/protein) on SWISSPROT. Only the best 20 first hits are recovered, with at least a score of 100 and an E-value of 1*e-5.

fold faq What are interspersed repeats?

Repeated DNA sequences, spreads on whole vertebrate genome. There are two main kinds of interspersed repeats : SINE (Short Interspersed Elements) and LINE (Long Interspersed Elements). For example, Alu are a frequent family of SINE sequences, presents on mammalian genome.

fold faq What is a score?

  • The Phred's score is a quality rating assigned to each base following the analysis of the trace by Phred. This value indicate the probability of error on determination of the nucleotid type (A, T, G or C). A Phred's score of 20 indicates that there is one chance on 100 that the base is not the nucleotid determined by Phred.
  • A Blast score (used by LASSAP to recover hit, for example) allows to quantify homology. It results from the sum of the elementary scores, calculated on each position of an HSP (High-scoring Segment Pairs) in with respect to the two sequences in their optimal matching, the sequence comparison by Blast can show that there are more than one local alignments (= homologuous zone = HSP) between 2 sequences. This score is a quality rating of the optimal alignment found for this HSP. The score is balanced by gaps, matches and other parameters. Basically, it is the total number of good matchs penalized by the number of mismatchs. On Sigenae interface, the indicated score for a hit, is the one of the HSP with the best score among the HSPs that results from the Blast between the hit and the Sigenae's sequence.
  • The Identity score is the percentage of identity of base pair between 2 sequences in the best alignment.

fold faq What is E-value?

E-value is Expect-value (E = a*e-X). It describes the number of hits that you can randomly find for your sequence in a database of a specific size. It indicates the risk of random matching between your sequence and those of the database. In the formula, the calculation of the E-value takes the length of sequences and the alignment score into consideration, respectively with the factors "a" and "x". Basically, the lower is the E-value (means closer to zero), the more the hit is significant. On Sigenae interface, the indicated E-value for a hit, is the one of the HSP with the best score among the HSPs that results from the Blast between the hit and the Sigenae's sequence.

Sigenae Data

fold faq What are putative SNP detection conditions?

In contigs with global depth of 7 or greater, putative SNPs are detected comparing sequences nucleotides and contig consensus at each position. To pass filter and considered as putative SNPs, observed dissimilarities at position P have to fulfil following conditions:

  • local depth at P position has to be at least equal to 7
  • flanking regions of 4 bases around P position have to be exactly conserved
  • minimal number of sequence having the lowest represented base has to be at least equal to 3
  • gaps on consensus sequence are ignored
  • N or gaps on sequences are ignored

fold faq What about E-value thresholds used for contigs annotation?

E-values thresholds filtering alignments between Sigenae contigs and other databanks are:

  • UniProt/RefSeq proteins: 1e-05
  • Pfam domains: 1e-05
  • Other Sigenae contigs: 1e-30
  • RefSeq RNA: 1e-05
  • UniGene/TIGR contigs: 1e-02
  • Ensembl transcripts: 1e-10

fold faq How do we do EST, batch and libraries "naming"?

  1. About libraries naming :
    -  In the case of private data, four letters are used, for example, in the library bcai : b is for bovin (species name), c for cDNA (type of sequence) and ai is the "bank name" choosed by the sequence provider.
    -  In the case of public data, the public (EMBL, dbEST, ...) database's library's identifier is used. This identifier can contains only figures or a mix of figures and characters, for example : AJ272372 or 12006.
  2. About batch naming :   Numbers of first and last plates of the batch are added to the library name, example : bcai01_50.1 . There isn't public batch.
  3. About EST naming :
    -  In the case of private data, for example bcai0001a.h.06_5.1 :
    1. bcai : is the library name.
    2. 0001a : 0001 is the plate number, and a is the copy version.
    3. h.06 : indicate the position on the plate, with h for column and 06 for the line.
    4. 5.1 : 5 is the sequencing direction and 1 is the version's number of the sequencing provider.
    -  In the case of public data, for example : 15089217.1
    1. 15089217 : is the Sigenae's library name and it's also the public database's EST's identifier (dbEST in this case).
    2. .1 : the 1 indicates the version's number.


fold faq How the contigs are named?

The name of the contigs are made as follows.

Example :

Two contigs : BU347886.1.p.gg.3 - gcag0004c.h.06_5.2.s.gg.8

First part is name of the oldest EST in the contig:

  • BU347886.1 - public EST
  • gcag0004c.h.06_5.2 - Agenae EST

A suffix like p.sp.v is added mentioning:

  • p for project type (p = public, s = sigenae)
  • sp for species (bt Bos taurus, sc Sus scrofa, …)
  • v for assembly version

fold faq What means "Cleaning of Sequences"?

Under this expression, we are talking about the process to witch sequences are submitted before beeing accepted for clustering. In this process, global quality and length of sequences are checked. After, differents possible "contaminations" will be masked.

The process will mask vector an adaptor, low phred quality nucleotids, presence of others vectors (using the Univec bank of vectors), yeast or E. coli DNA, ribosomic RNA, mitochondrial DNA, low complexity regions (PolyA ...) and repeats (like mini/micro-satellites, SINE and LINE sequences, ...). After this, surviving sequences will feed the clustering process.

fold faq What means "Low complexity regions"?

Low complexity regions are DNA parts with a weak informative potential, they are uncoding regions. For example, they are simple tandem repeats, polypurine and AT-rich regions, or PolyA tails.

fold faq Which are the parameters establishing the sequence validy?

It is necessary to distinguish two types of validity :

  • in the invoicing meaning : in that sense, a sequence is valid if it has at least 450 bp with a phred score over 20, without vector and adaptator. The 450 bp can be not consecutive. If a batch has more than 79% of valid sequences, it is valid.
  • in the SIGENAE meaning : according to the SIGENAE criteria, a sequence is valid if it has at least 100 bp with a phred score over 20, without vector and adaptator. The 100 bp can be not consecutive.

Sigenae Tools

fold faq How to search data in databases?

Tools used to search data will depend on searched data type. For data related to:

  • ESTs: use SURF web interface available on Public Data Access page through Sequences buttons or the Sequence/Contig Search bar under the Sigenae logo;
  • Contigs: use the Sigenae Contig Browser search module available at the top right-hand corner of each Species Pages or the Sequence/Contig Search bar under the Sigenae logo;
  • General information: use the web site Search Engine under the login box.


fold faq What are rules to use the Search Engine?

The Search Engine, located under the login box, close to the Sequence/Contig Search bar, allows you to search for information on our website.
Only words with 2 or more characters are accepted.
200 characters total.
Space is used to split words, "" can be used to search for a whole string (not indexed search then).
AND, OR and NOT are prefix words, overruling the default operator.
+/|/- equals AND, OR and NOT as operators.
All search words are converted to lowercase.

fold faq How to use the Sequence/Contig Search bar?

The Sequence/Contig Search bar, located under the Sigenae logo, provide a simple and rapid tool to find either a sequence or a contig in the different Sigenae databases. You can use the special character '*' if you know only part of the sequence or contig name. Remenber that your search depends on your login status : logged users can interrogate specific databases linked to their user group while not logged users can only interrogate public databases.

fold faq How to make a blast against a species dedicated database?

Use the NCBI Blast
In the Choose Search Set section, use the "Organism" filed to type/select your species. This field is available only for generic sets like "nr".
If you are looking for Bacs, use the gss database.

fold faq Where can I find a synthetic representation of the clustering and contiging results?

You have access here to a synthesis of all the clustering and assembling results.