Etusivu | Opinto-opas, Metropolia

Exploratory Data Analysis with Python (10 cr)

Code: TT00EU30-3003

General information

Timing

01.01.2022 - 31.12.2022

Number of ECTS credits allocated

10 op

Virtual portion

6 op

Mode of delivery

40 % Contact teaching, 60 % Distance learning

Unit

ICT ja tuotantotalous

Campus

Karaportti 2

Teaching languages

English

Seats

0 - 1000

Degree programmes

Tieto- ja viestintätekniikan tutkinto-ohjelma

Teachers

Virve Prami

Groups

ATX22TV
NonStop virtuaaliopinnot vuosi 2022

Objective

Exploratory Data Analysis (EDA) is a combination of multiple techniques that extract valuable insights and meaningful information from the data. The main aim of EDA is to investigate datasets to reveal the underlying structures, challenges, and opportunities of data without attempting to apply any machine learning model. This course will introduce the student to the practical knowledge and the main pillars of EDA including data exploration, data preparation, data visualization, data relationships and data clustering using Python programming language. Apart from the intuitions, the student will get familiar with how EDA steps are performed by various Python libraries such as NumPy, Pandas, and Matplotlib. After passing this course, the student will be prepared to enter the fantastic world of data analysis towards amazing job positions in the industry.

This course is 100% virtual, thanks to figures and content prepared for this course.

Content

1. Introduction:
Introduction to Data Science – Data Science Workflow – Data – Sources of Data – What is Exploratory Data Analysis? – Python Libraries for EDA

2. Describing Data:
Introduction – Observations and Variables – Categorical Variables – Continuous Variables – Central Tendency – Data Variability – Data Distributions

3. Importing Data:
Introduction – Vector and Matrix – NumPy Arrays – Working with NumPy Arrays – Loading Data with NumPy – Pandas Series – Working with Series – Pandas DataFrame – Working with DataFrame – Loading Data with Pandas

4. Data Exploration:
Extracting Descriptive Statistics – Extracting Descriptive Statistics: Preliminaries – Extracting Descriptive Statistics: Implementation – Mathematical Operations on DataFrame – Applying Functions to DataFrame – Querying a DataFrame – Filtering Data – Groupby – Identifying Unique and Missing Values – Cross Tabulation

5. Data Visualization:
Univariate Analysis – Histogram – Frequency Polygons – Boxplot – Bar Chart – Pie Chart- Multivariate Analysis – Plot – Subplot – Scatter Plot – Bubble Chart

6. Data Preparation:
Introduction – Incorrect Values and Categories – Feature Engineering: Creating New Features –Outlier Detection: Univariant –Outlier Detection: Multivariant – Removing Missing Values – Imputing Missing Values: Constant Imputation – Imputing Missing Values: K-NN Imputation – Feature Encoding: Label Encoding – Feature Encoding: One-Hot Encoding – Feature Scaling: Normalization – Feature Scaling: Standardization

7. Data Relationships:
Introduction – Covariance Matrix – Heatmap of Covariance Matrix – Correlation – Non-linear Relationship – Hypothesis Testing

8. Identifying and Understanding Groups
Introduction – Clustering – Association Rules – Hierarchical Clustering – K-Means Clustering

9. Next Steps:
What’s More? – EDA for Text Data – Model Development and Evaluation

10. Final Tasks:
Self-study Essay – Project

Location and time

Online TechClass portal.

Materials

Lecture slides, quizzes, exercises

Teaching methods

- Exercise
- Quiz
- Project
- Self-study

Employer connections

N/A

Exam schedules

N/A

International connections

N/A

Completion alternatives

N/A

Student workload

Lectures = 70h
Exercises = 20h
Self-study = 40h
Quizzes = 15h
Project = 45h
Total = 190 hours

Evaluation scale

Hyväksytty/Hylätty

Assessment criteria, approved/failed

The student will pass this course after submitting the required quizzes, assignments, and the final project.

Assessment methods and criteria

Exercise 20%
Quiz 30%
Project 30%
Essay 20%

Qualifications

Introduction to Python for Data Science