This course will equip students with both the theory and practical knowledge necessary for starting analyzing complex data generated in modern life.. Students will learn how to perform data analysis in an exploratory style, create effective graphs that help them understand the data and also communicate the extracted information to others, build sophisticated predictive models, and meanwhile they will get familiar with R, a very powerful software tool for statistical analysis. In the following, objectives are split.
Knowledge and understanding.
Know main kinds of problems which can be tackled with
Know main DA and SL techniques.
Know design, development, and assessment phases of a SL procedure.
Applying knowledge and understanding.
Formulate a formal problem statement for simple practical problems in order to tackle them with DA and SL techniques.
Develop simple end-to-end DA and SL procedures.
Experimentally assess a simple end-to-end DA and SL procedure.
Judge the technical soundness of a DA and SL procedure.
Judge the technical soundness of the assessment of a DA and SL procedure.
Describe the motivations behind choices in the design, development, and assessment of a DA and SL procedure, possibly exploiting simple plots.
Retrieve information from scientific publications about DA and SL techniques not explicitly presented in this course.
Basics of statistics: basic graphical tools of data exploration; summary measures of variable distribution (mean, variance, quantiles); fundamentals of probability and of univariate and multivariate distribution of random variables; basics of linear regression analysis.
Basics of linear algebra: vectors, matrices, matrix operations; diagonalization and decomposition in singular values.
Basics of programming and data structures: algorithm, data types, loops, recursion, parallel execution, tree.
Introduction to data science; data analytics, machine learning and statistical learning approaches: common and distinctive aspects (more and more different in name only).
Recap. of main concepts and tools of probability and statistical inference (pre-course).
Elements of data exploration and visualization with R.
Elements of statistical learning; regression function; assessing model accuracy and the bias-variance trade-off; cross-validation methods.
Supervised learning and linear models; model validation and selection; hints to regularization and extensions.
Supervised learning for classification.
Training and test error rate; the Bayes classifier.
Linear and quadratic discriminant analysis.
The K-nearest neighbors classifier.
Dimensionality reduction methods: principal component analysis; biplot.
Cluster analysis: hierarchical methods, partitional methods (k-means algorithm).
Frontal lessons with blackboard and slide projection; exercises, under teacher supervision, in dealing with simple problems with DA and SL techniques.
Final exam according to one of the following two options (student’s choice):
Written test + project (the final mark is the average of the two marks).
Written test with questions on theory and application with short open answers.
Project (home assignment) in which the student chooses a problem among a closed, teacher-defined set of problems and proposes a solution based on DA and SL techniques. The expected outcome is a written document (few pages) including: the problem statement; a description of the proposed solution; the results and a discussion about the experimental assessment of the solution with, if applicable, information about used data. Student may form groups for the project. The project will be evaluated according also to clarity.
Written test only.
Written test with questions on theory and application with medium- and short-length open answers.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning: Data Mining, Inference, and Prediction. Springer, Berlin: Springer Series in Statistics, 2009.
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning, with applications in R. Springer, Berlin: Springer Series in Statistics, 2014.