Indiana University Bloomington

Statistics S675
Statistical Learning And High-dimensional Data Analysis

Contact: Michael Trosset October 19, 2012
Offered: Fall, 2015
Class Time: 1:25-2:15 WH 106
Class Days: M, W, F
Sequence: S675 is not yet part of a sequence. The Department of Statistics and the School of Informatics & Computing hope to coordinate and organize their various courses on machine learning into a 4-course sequence.
Pre-Requisites: A course in linear algebra is essential. S675 makes extensive use of matrix notation and several matrix factorizations. Some familiarity with vector calculus is also assumed. STAT-S 710 provides more than sufficient background.
Algebra Required: Used extensively throughout the course, including proofs and homework assignments.
Calculus Required: Used primarily for concepts and derivations.
Contact Person for Authorization: Permission of the instructor.
Instructor: Michael Trosset
Recommended follow-up classes: The Department of Statistics sometimes offers more advanced courses on related topics, e.g., machine learning and model selection.
Syllabus: No Syllabus Avaliable
Keywords: machine learning, multivariate structure, dimension reduction, cluster analysis, classification.
Description: S675 explores a variety of methods for detecting structure in multivariate data sets. Major topics include dimension reduction (principal component analysis, multidimensional scaling, manifold learning), unsupervised learning (k-means clustering, spectral clustering), and supervised learning (linear discriminant analysis, support vector machines, nearest neighbor classification).
Books: Lecture notes and journal articles provided by the instructor. See the S675 web page.
Substantive Orientation: Any discipline that is concerned with high-dimensional data. Such data can arise in various ways, often as multiple measurements on each of several objects/subjects, as in text mining of microarray experiments, but also as measurements of pairwise proximi
Statistical Orientation: Most of the methods studied in S675 do not assume probability models.
Applied/Theoretical: Intermediate. S675 is closely related to, but somewhat more theoretical than SOIC-B 565 (Data Mining) and SOIC-B 555 (Machine Learning).
Software: R
How Software is Used: Students write programs that implement the methods studied in S675. They use these programs and/or programs written by others to analyze data.
Problem Sets: Weekly.
Data Analysis: Yes, but primary emphasis is on understanding how the methods work. Small, synthetic data sets often serve this objective better than large, real data sets.
Presentation: Each student writes a paper on a topic related to the topics discussed in class.
Exams: No.