Exploratory Data Analysis with Python (10 cr)
Code: TT00EU30-3003
General information
Timing
01.01.2022 - 31.12.2022
Number of ECTS credits allocated
10 op
Virtual portion
6 op
Mode of delivery
40 % Contact teaching, 60 % Distance learning
Unit
ICT ja tuotantotalous
Campus
Karaportti 2
Teaching languages
- English
Seats
0 - 1000
Degree programmes
- Tieto- ja viestintätekniikan tutkinto-ohjelma
Teachers
- Virve Prami
Groups
-
ATX22TVNonStop virtuaaliopinnot vuosi 2022
Objective
Exploratory Data Analysis (EDA) is a combination of multiple techniques that extract valuable insights and meaningful information from the data. The main aim of EDA is to investigate datasets to reveal the underlying structures, challenges, and opportunities of data without attempting to apply any machine learning model. This course will introduce the student to the practical knowledge and the main pillars of EDA including data exploration, data preparation, data visualization, data relationships and data clustering using Python programming language. Apart from the intuitions, the student will get familiar with how EDA steps are performed by various Python libraries such as NumPy, Pandas, and Matplotlib. After passing this course, the student will be prepared to enter the fantastic world of data analysis towards amazing job positions in the industry.
This course is 100% virtual, thanks to figures and content prepared for this course.
Content
1. Introduction:
Introduction to Data Science – Data Science Workflow – Data – Sources of Data – What is Exploratory Data Analysis? – Python Libraries for EDA
2. Describing Data:
Introduction – Observations and Variables – Categorical Variables – Continuous Variables – Central Tendency – Data Variability – Data Distributions
3. Importing Data:
Introduction – Vector and Matrix – NumPy Arrays – Working with NumPy Arrays – Loading Data with NumPy – Pandas Series – Working with Series – Pandas DataFrame – Working with DataFrame – Loading Data with Pandas
4. Data Exploration:
Extracting Descriptive Statistics – Extracting Descriptive Statistics: Preliminaries – Extracting Descriptive Statistics: Implementation – Mathematical Operations on DataFrame – Applying Functions to DataFrame – Querying a DataFrame – Filtering Data – Groupby – Identifying Unique and Missing Values – Cross Tabulation
5. Data Visualization:
Univariate Analysis – Histogram – Frequency Polygons – Boxplot – Bar Chart – Pie Chart- Multivariate Analysis – Plot – Subplot – Scatter Plot – Bubble Chart
6. Data Preparation:
Introduction – Incorrect Values and Categories – Feature Engineering: Creating New Features –Outlier Detection: Univariant –Outlier Detection: Multivariant – Removing Missing Values – Imputing Missing Values: Constant Imputation – Imputing Missing Values: K-NN Imputation – Feature Encoding: Label Encoding – Feature Encoding: One-Hot Encoding – Feature Scaling: Normalization – Feature Scaling: Standardization
7. Data Relationships:
Introduction – Covariance Matrix – Heatmap of Covariance Matrix – Correlation – Non-linear Relationship – Hypothesis Testing
8. Identifying and Understanding Groups
Introduction – Clustering – Association Rules – Hierarchical Clustering – K-Means Clustering
9. Next Steps:
What’s More? – EDA for Text Data – Model Development and Evaluation
10. Final Tasks:
Self-study Essay – Project
Location and time
Online TechClass portal.
Materials
Lecture slides, quizzes, exercises
Teaching methods
- Exercise
- Quiz
- Project
- Self-study
Employer connections
N/A
Exam schedules
N/A
International connections
N/A
Completion alternatives
N/A
Student workload
Lectures = 70h
Exercises = 20h
Self-study = 40h
Quizzes = 15h
Project = 45h
Total = 190 hours
Evaluation scale
Hyväksytty/Hylätty
Assessment criteria, approved/failed
The student will pass this course after submitting the required quizzes, assignments, and the final project.
Assessment methods and criteria
Exercise 20%
Quiz 30%
Project 30%
Essay 20%
Qualifications
Introduction to Python for Data Science