Data Analytics in Environmental Health and Engineering
3.0
creditsAverage Course Rating
Data analytics is a field of study involving computational statistics, data mining and machine learning, to explore data sets, explain phenomena and build predictive models. The course begins with an overview of some traditional analysis approaches including ordinary least squares regression and related topics, notably diagnostic testing, detection of outliers and methods to impute missing data. More recent developments are presented, including ridge regression. Generalized linear models follow, emphasizing logistic regression and including models for polytomous data. Variable subsetting is addressed through stepwise procedures and the LASSO. Supervised machine learning topics include the basic concepts of boosting and bagging and several techniques: Decision Trees, Classification and Regression Trees, Random Forests, Conditional Random Forests, Adaptive Boosting, Support Vector Machines and Neural Networks. Unsupervised machine learning approaches are addressed through applications using k-means Clustering, Partitioning Around Medoids and Association Rule Mining. Methods for assessing model predictive performance are introduced including Confusion Matrices, k-fold Cross-Validation and Receiver Operating Characteristic Curves. Public health and environmental applications are emphasized, with modeling techniques and analysis tools implemented in R. EN.570 616 meets with EN.570.416. Undergraduate (usually Senior) students should sign up for 416 with permission of instructor only.