The Personal Analytical Pipeline Approach to Data Mining and Phylogenetics
David C. Plachetzki
The rate of growth of biochemical sequence data has often outstripped our ability to effectively analyze datasets of very large magnitude using non-informatic approaches. Even the task of assembling a comprehensive single-locus dataset for phylogenetic analyses can be too great to accomplish manually, given the amount of data that could be potentially included. For these reasons automated data assembly “pipelines” have become a popular solution. This pipeline approach can prove indispensible even for smaller datasets as the analytical structure of the pipeline can be adapted to serve a variety of questions and can produce results in a matter of seconds. Data assembly and analysis pipelines are often constructed by linking together existing applications using scripting languages such as BioPerl and take advantage of some useful properties of the linux operating system. Yet, irrespective of the tractability of such methods, computational approaches remain elusive to many biologists.
The purpose of this workshop is to introduce interested students, postdocs and faculty to some of the more useful and powerful aspects of these computational approaches using the construction of a portable, personalized, analytical pipeline as an example. Importantly, no previous experience with BioPerl or linux will be required of participants of the workshop. We will begin by focusing on the creation of local searchable databases and gaining access to external curated databases. Next, some useful scripts in BioPerl will be discussed that allow a range of automated manipulations of mined data including formatting changes, annotation and extraction of sequences and the use of user-defined thresholds of similarity to remove or retain sequences. The algorithms utilized for alignment and phylogenetic analyses will vary with the given application. I will describe how alternative alignment and phylogenetic analyses can be passed in and out of the data assembly and analysis pipeline with little effort and how comparisons between alternative procedures can be compared in an automated fashion. Finally, I will explore applications and other benefits of maintaining a customizable analytical pipeline and show how a range of phylogenetic questions can be addressed with greater ease by taking advantage of certain elements of the pipeline once in place. This workshop should be of interest to anyone working with phylogenetic or computational approaches involving biochemical sequence data.