i5K Pilot Project Summary

About the Project

i5K logoThe BCM-HGSC is sequencing 28 arthropod genomes as a pilot project to kickstart the i5K.

Collaboration Status Pages

The i5K is an initiative to sequence the genomes of 5,000 arthropod species. This pilot project builds on our extensive experience sequencing many arthropods over the years, including D. melanogaster, D. pseudoobscura, the honeybee, the red flour beetle, the pea aphid, the hessian fly, the centipede, and many others.

The i5K was first announced in March 2011 in a letter to Science Magazine and other press releases - for example, from the Entomological Society of America, to provide a base reference for understanding the molecular nature of arthropods. It is our hope that this information be of medical, agricultural, ecological and scientific benefit to the world.

More information about the i5K can be found at the i5K project site where you can additionally sign up to various roles and become involved in the larger projects goal of generating the genomes of 5,000 arthropods.

Because of the relatively large number of species we will be using a table format to publicly track our progress, and release raw sequence data, assembled sequence data, transcriptome data and annotation data as soon as possible.

Species Selection

To select species we have worked with the i5K species selection committee, a group of more than 20 entomologists, biologists, and systematicists and genomics researchers. They have had multiple goals in the selection of species, including medical importance, agricultural importance, filling phylogenetic genomic holes, and attempts to addressing biological problems.

Sequence Generation and Genome Assembly

The large number of species in the i5K pilot requires a low cost method of sequence generation, and a high quality whole genome assembly method.

Many people have asked us the best way of performing such assemblies, and the data and computer hardware and software needed. New sequencing methods, most notably the Illumina HiSeq technology, have allowed sequence information to be generated at relatively low cost to enable this pilot project.

Material for DNA Isolation

For arthropods genome sequencing, we recommend that the input DNA sequence be as non-polymorphic as possible. The ideal is a large haploid individual (for example from a large male hymenopteran allowing the generation of up to 50ug of genomic DNA. The next best is an inbred line, with 12-20 generations of sib-sib inbreeding. 12 generations theoretically makes 90% of the genome homozygous, 20 is theoretically close to 99% and any additional sib-sib inbreeding beyond this does not significantly reduce homozygocity in the sample. If inbreeding cannot be performed, the next best is a single large individual, so the assembler will only have to deal with a single diploid sequence. If the individual is small enough that multiple individuals are required for sufficient DNA, we recommend that the main library be made from a single individual using a low input DNA protocol, and DNA isolated from pooled individuals used for libraries of larger insert sizes requiring more DNA for gel cuts.

DNA isolation

We currently recommend a qiagen kit—the Qiagen DNeasy Blood and Tissue Kit. Use the Animal Tissue (spin-Column) extraction protocol, making sure to complete the RNase step, otherwise there will be RNA contamination. This has worked well for DNA isolation from single Nasonia, but we still need to formally collect experiences with other DNA isolation protocols.

Sequence Generation for Assembly

For this project we are generating fairly high coverage in a number of different insert sized libraries. The assembly strategy is based around a seed allpaths (the Broad Allpaths assembler) assembly followed by seed assembly improvement using homegrown tools, Atlas-link (link to software page) and Atlas-GapFill, which can significantly improve the results.

Thus we generate sequence data to enable the Allpaths assembly. As of Nov 2011 this is: - 40X genome coverage in 180bp insert library (100bp reads forward and reverse) 40X 3kb insert data. To enable better scaffolding and local gap filling we additionally generate 500bp and 8kb insert sizes at > 20X coverage.

Genome Annotation

In addition to genome sequencing, we are also performing a modest amount of RNAseq to generate data for automated annotation. For each species we will generate RNA seq data for 3 tissues—usually whole adult males, whole adult females and mixed other lifestages. This data will be used with additional protein homology data for a MAKER automated annotation of the new genomes.

Additional analysis and annotation by the i5K analysis groups

Each of the sequenced species will have a community analysis and publication group led by the researchers providing the DNA, to enable full analysis of each genome. These groups will additionally have help from the i5Ks multiple working groups.

Access to the Data

All data will be downloadable from the individual species pages or from the BCM-HGSC FTP site as soon as it is generated.

Genomic Resources