Skip to content Skip to navigation

DATA ANALYTICS AND STATISTICAL LEARNING (469SM)

A.Y. 2020 / 2021

Period 
Second semester
Credits 
6
Duration/Length 
48
Type of Learning Activity 
Related/additional subjects
Study Path 
[PDS0-2018 - Ord. 2018] common
Mutuazione 
MUTP: SM35 - 450SM - STATISTICAL LEARNING FOR DATA SCIENCE
Syllabus 
Teaching language 

English

Learning objectives 

This course will equip students with both the theory and practical knowledge necessary for starting analyzing complex data generated in modern life. Students will learn how to perform data analysis in an exploratory style, create effective graphs that help them understand the data, build sophisticated predictive models and also communicate the extracted information to others. Meanwhile, they will get familiar with R, a very powerful software tool for statistical analysis. In the following, objectives are split.

Knowledge and understanding: know main kinds of problems which can be tackled with SL techniques; know main SL techniques; know design, development, and assessment phases of a SL procedure.

Applying knowledge and understanding: formulate a formal problem statement for simple practical problems in order to tackle them with SL techniques; develop simple end-to-end SL procedures; experimentally assess a simple end-to-end SL procedure.

Making judgements: judge the technical soundness of a SL procedure and of the assessment of a SL procedure.

Communication skills: describe the motivations behind choices in the design, development, and assessment of a SL procedure, possibly exploiting simple plots; being able to communicate the results to experts and to non-experts.

Learning skills: retrieve information from scientific publications about SL techniques (also not explicitly presented in the course) possibly combining them to solve complex problems.

Prerequisites 

Statistical methods and models as from the course Statistical Methods for Data Science.
Basics of linear algebra: vectors, matrices, matrix operations; diagonalization and decomposition in singular values.
Basics of R programming.

Contents 

Introduction to statistical learning (SL) approach to data science.
Elements of data cleaning, exploration and visualization with R.
Elements of statistical learning; regression function; assessing model accuracy and the bias-variance trade-off; cross-validation methods.
Supervised learning for regression; extensions to the linear model by fitting procedure (subset selection, shrinkage and dimension reduction methods) and by advanced specifications (mixed effects and nonparametric regression models).
Supervised learning for classification; the Bayes classifier; logistic regression; linear and quadratic discriminant analysis; the K-nearest neighbors classifier.
Unsupervised learning; dimensionality reduction methods: principal component analysis and biplot; cluster analysis: hierarchical, partitional and density-based methods.
Textual data analysis: SL techniques to content mapping and text classification.

Teaching format 

Frontal lessons with blackboard and slide projection; exercises, under teacher supervision, in dealing with simple problems with SL techniques.

Extended Programme 

Introduction to statistical learning (SL) approach to data science.
Elements of data cleaning, exploration and visualization with R.
Elements of statistical learning; regression function; assessing model accuracy and the bias-variance trade-off; cross-validation methods.
Supervised learning for regression; extensions to the linear model by fitting procedure (subset selection, shrinkage and dimension reduction methods) and by advanced specifications (mixed effects and nonparametric regression models).
Supervised learning for classification; the Bayes classifier; logistic regression; linear and quadratic discriminant analysis; the K-nearest neighbors classifier.
Unsupervised learning; dimensionality reduction methods: principal component analysis and biplot; cluster analysis: hierarchical, partitional and density-based methods.
Textual data analysis: SL techniques to content mapping and text classification.

End-of-course test 

Final exam according to one of the following two options (student’s choice):
- Written test + project (the final mark is the average of the two marks).
- Written test only with questions on theory and applications with medium- and short-length open answers.
Project (home assignment): the student chooses a problem among a closed, teacher-defined set of problems and proposes a solution based on SL techniques. The expected outcome is a written document (few pages) including: the problem statement; a description of the proposed solution; the results and a discussion about the experimental assessment of the solution with, if applicable, information about used data. Student may form groups for the project. The project will be evaluated according also to clarity.

Other information 

Bring your own laptop.

Texts/Books 

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning, with applications in R. Springer, Berlin: Springer Series in Statistics, 2014.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning: Data Mining, Inference, and Prediction. Springer, Berlin: Springer Series in Statistics, 2009.


Back to list of courses