Core Courses
Programming in R and Python
The Georgetown Data Science & Analytics program gives an online course in programming preparation that covers R, Python, and command line use in the summer prior to matriculation. The course is equivalent to three credits, is designed for matriculating MS Data Science & Analytics students, and is offered free of charge. It is required for all incoming students. This course will run during the Georgetown Summer Session (May – August). Students must complete this course to matriculate in the fall.
DSAN-5000: Data Science and Analytics
This course introduces students to several core data science concepts. It teaches students how to synthesize disparate, possibly unstructured data to better understand and characterize the world, and in some cases, to draw meaningful inferences. Topics covered include: the history of data science, successes and failures in data analytics, the data analytics life cycle, data/web scraping and APIs, data wrangling, data characterization (correlations, identifying clusters and associations), data inference and basic machine learning, network analysis, data ethics, and visual analytics. Students will work on a semester-long data science project that starts with question formulation and data collection, and goes through all the stages of the life cycle, culminating in data storytelling. The course also maps data science case studies to topics presented throughout the semester. Prerequisites: Intermediate coding experience in Python3, and knowledge of introductory statistics. 3 credits.
DSAN-5100: Probabilistic Modeling and Statistical Computing
Probabilistic models are essential for the understanding of data that are affected by uncertainty. This course introduces students to the fundamentals of probabilistic modeling and then covers computational techniques for the analysis of such data. After introducing basic concepts and approaches such as probability distributions, random variables, and conditioning, the course covers basic probability distributions that are frequently used in practice and some of their properties, such as Laws of Large Numbers. In the second half, students will learn about computational techniques for the use of probabilistic models. This includes methods for faithful simulation of random variables (Monte Carlo), the extraction of condensed models from observed data (maximum likelihood, Bayesian models), methods for models with hidden or partially observed variables (latent variables, expectation-maximization, hidden Markov models), and some general data science techniques that incorporate probabilistic models (graphical models, stochastic optimization). Prerequisites: Introductory statistics, some coding experience (e.g. R). 3 credits.
DSAN-5200: Advanced Data Visualization
Presenting quantitative information in visual form is an essential communication skill for data professionals. This course introduces representation methods and visualization techniques for complex data, drawing on insights from cognitive science and graphic design. Students will obtain an overview of the human visual system, learn to use models for data and for images, and acquire good design practices, such as those using the “grammar of graphics.” Students will use common statistical design tools such as graphic methods in Python3, interactive graphic methods such as Bokeh, Leaflet, and NetworkD3, the R package ggplot2, and Tableau. Prerequisites: DSAN-5000. 3 credits.
DSAN-5300: Statistical Learning
Statistical Learning is concerned with algorithms that use statistical techniques to find structure or patterns in given data (unsupervised learning) or use given instances of data to predict outcomes in new cases (supervised learning). A well-known method of this type is linear regression, and this will be covered early in the course. Statistical methods for making discrete predictions (classification) such as logistic regression will also be covered. Special emphasis will be placed on techniques for handling high-dimensional data (i.e. instances with many attributes), including variable selection and dimension reduction. The course will also cover ensemble methods such as bagging and boosting that are often used to improve the results of given classification methods. Unsupervised methods covered in this course include model-based and hierarchical clustering. Prerequisites: DSAN-5100. 3 credits.
DSAN-6000: Big Data and Cloud Computing
Today’s data scientists are commonly faced with huge data sets (Big Data) that may arrive at fantastic rates and in a broad variety of formats. This core course addresses the resulting challenges. The course will introduce students to the advantages and limitations of distributed computing and to methods of assessing its impact. Techniques for parallel processing (MapReduce) and their implementation (Hadoop) will be covered, as well as techniques for accessing unstructured data and for handling streaming data. These techniques will be applied to real world examples, using clusters of computational cores and cloud computing. Prerequisite: Working knowledge of Python and the Unix command line, some knowledge of data structures and DSAN-5000. 3 credits.