CIS 890: Machine Learning for Bioinformatics (3 credits)

 

Primary Instructor: Dr. Doina Caragea (dcaragea@ksu.edu)

Guest Instructor: Dr. Susan J. Brown (sjbrown@ksu.edu)

Meeting Time: T U 9:30 - 10:45pm

 

Prerequisites: Basic knowledge on probability and statistics, data structures and algorithms. Background in biology or machine learning is NOT required.

 

Targeted audience: Graduate students from Computer Science and related areas, as well as students from Biology and related areas.

 

Why bioinformatics?

In barely half a century computer science has grown from infancy to maturity. Employment in Computer Science was assured until a few years ago. Today, however, like in the 1960s, when demand for physicists waned, computer scientists are eager to explore new possibilities in promising fields. Biology and its related disciplines like bioinformatics are at the top of the list.” (Jacques Cohen, Computer Science and Bioinformatics)

 

Course Objectives:

·  Understand biological concepts and bioinformatics problems

·  Understand machine learning algorithms for bioinformatics

·  Learn how to apply machine learning algorithms to challenging bioinformatics problems

·  Learn how to compare different machine learning algorithms and how to select the best method for a particular problem

·  Learn how to use existing bioinformatics tools and how to design and develop new tools.

 

Course Overview

Advances in high-throughput experiments and sequencing technologies have resulted in large amounts of data in biological sciences. This has led to unprecedented opportunities for large-scale knowledge discovery in a number of areas, including characterization of macromolecular sequence-structure-function relationships and discovery of complex genetic regulatory networks, among others. 

Machine learning algorithms offer some of the most cost-effective approaches to automated knowledge discovery in emerging data-rich disciplines. In this course, some of the most important machine learning algorithms and their applications to bioinformatics are discussed. No background in biology or machine learning is assumed.

Dr. Susan J. Brown in Biology will introduce the background for the biological problems discussed in this course. Dr. Doina Caragea will describe the machine learning algorithms that can be used to address these problems. The application of the algorithms to the biological problems will be discussed using recent bioinformatics research papers. Students are expected to lead discussion on some of these papers.

 

Topics

The planned bioinformatics topics will cover applications in the areas of sequence analysis, structural genomics, functional genomics and proteomics.  Specific examples include: gene finding, transcription binding site prediction, protein function, localization and structure prediction, protein-protein interaction prediction, microarray data analysis, gene selection, gene interaction networks, biological text mining, etc. 

Machine learning topics include: probabilistic and graphical models, in particular, Bayesian Networks and Hidden Markov Models (HMM), network models, unsupervised learning and clustering, supervised learning and classification, kernel methods and Support Vector Machines (SVM), neural networks, etc.

Weka, an open source machine learning programming environment (in Java), will be used for lab assignments, which will provide hands-on experience with the methods discussed and their application to bioinformatics problems. 

 

Recommended Textbook:

BIOINFORMATICS: The Machine Learning Approach (Second Edition). Pierre Baldi and Soren Brunak, MIT Press, 2001.

http://www.cbs.dtu.dk/mitbook/mitbook2nd.php

Kernel Methods in Computational Biology. Bernhard Schölkopf, Koji Tsuda, and Jean-Philippe Vert, editors, MIT Press, Cambridge, MA, 2004.

http://mitpress.mit.edu/catalog/item/default.asp?tid=10338&ttype=2

Part of the course will draw on material from other books and recent research papers.

 

Course work and evaluation:

There will be several individual assignments, several reading assignments and paper presentations, and a project (individual or group). There are no exams. Students will be required to write summaries for the assigned readings and to present some of the application papers, in particular papers related to their project topic. Individual assignments and the project will be evaluated based on design and analysis, writing and critical thinking. They will involve usage of tools and programming in Weka environment. Students from Biology are expected to form teams with students from Computer Science when programming is required. Projects will be evaluated by the peer students and by the instructors based on a final project presentation and a project report. Students are encouraged to attend every lecture and participate in class discussion.

 

Grading:

Assignments:   30% (individual)

Presentations:  20% (individual or team)

Project:           40% (individual or team)

Participation:   10%