Download PDf directly
Or Read Directly
Computer Science
Machine Learning
Dr. Waheed Anwar
Computer Science
Machine learning Life cycle
• Machine learning life cycle is a cyclic process to build an
efficient machine learning project. The main purpose of the
life cycle is to find a solution to the problem or project.
• Machine learning life cycle involves seven major steps, which are given below:
1. Gathering Data
2. Data preparation
3. Data Wrangling
4. Analyse Data
5. Train the model
6. Test the model
7. Deployment
Computer Science
• The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the problem
because the good result depends on the better understanding of the problem
Machine learning Life cycle
Computer Science
• 1. Data Gathering
• Data Gathering is the first step of the machine learning life cycle. The goal of
this step is to identify and obtain all data-related problems.
• data can be collected from various sources such as files, database, internet,
or mobile devices.
➢Identify various data sources
➢Collect data
➢Integrate the data obtained from different sources
• By performing the above task, we get a coherent set of data, also called as a dataset. It will
be used in further steps.
• The quantity and quality of the collected data will determine the efficiency of the output.
• The more will be the data, the more accurate will be the prediction.
Machine learning Life cycle
Computer Science
2. Data preparation
• Data preparation is a step where we put our data into a suitable place and prepare it to use in our
machine learning training.
• In this step, first, we put all data together, and then randomize the ordering of data.
• This step can be further divided into two processes:
• Data exploration:
It is used to understand the nature of data that we have to work with. We need to understand the
characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations, general
trends, and outliers.
• Data pre-processing:
Now the next step is preprocessing of data for its analysis.
Machine learning Life cycle
Computer Science
• 3. Data Wrangling
• Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step. It is one of the most important steps of
the complete process. Cleaning of data is required to address the quality issues.
• It is not necessary that data we have collected is always of our use as some of the data may not be
useful. In real-world applications, collected data may have various issues, including:
• Missing Values
• Duplicate data
• Invalid data
• Noise
• So, we use various filtering techniques to clean the data.
• It is mandatory to detect and remove the above issues because it can negatively affect the quality of
the outcome.
Machine learning Life cycle
Computer Science
• 4. Data Analysis
• Now the cleaned and prepared data is passed on to the analysis step. This step involves:
1. Selection of analytical techniques
2. Building models
3. Review the result
• The aim of this step is to build a machine learning model to analyze the data using various analytical
techniques and review the outcome. It starts with the determination of the type of the problems,
where we select the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and evaluate the model.
• Hence, in this step, we take the data and use machine learning algorithms to build the model.
Machine learning Life cycle
Computer Science
• 5. Train Model
• in this step we train our model to improve its
performance for better outcome of the problem.
• We use datasets to train the model using various
machine learning algorithms. Training a model is
required so that it can understand the various patterns,
rules, and, features.
Machine learning Life cycle
Computer Science
• 6. Test Model
• In this step, we check for the accuracy of our
model by providing a test dataset to it.
• Testing the model determines the percentage
accuracy of the model as per the requirement of
project or problem.
Machine learning Life cycle
Computer Science
• 7. Deployment
• The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.
• If the above-prepared model is producing an accurate result as per our
requirement with acceptable speed, then we deploy the model in the
real system. But before deploying the project, we will check whether it
is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.
Machine learning Life cycle
Computer Science
• A dataset is a collection of data in which data is arranged in some
order. A dataset can contain any data from a series of an array to a
database table.
• Types of data in datasets
• Numerical data: Such as house price, temperature, etc.
• Categorical data: Such as Yes/No, True/False, Blue/green, etc.
• Ordinal data: These data are similar to categorical data but can be measured on the basis of comparison.
• Note: A real-world dataset is of huge size, which is difficult to manage and process at the initial level.
Therefore, to practice machine learning algorithms, we can use any dummy dataset.
Dataset for Machine Learning
Computer Science
• The key to success in the field of machine learning or to
become a great data scientist is to practice with
different types of datasets. But discovering a suitable
dataset for each kind of machine learning project is a
difficult task. So, in this class, I shall provide the detail of
the sources from where you can easily get the dataset
according to your project.
Datasets for Machine Learning
Computer Science
• During the development of the ML
project, the developers completely
rely on the datasets. In building ML
applications, datasets are divided
into two parts:
• Training dataset:
• Test Dataset
Datasets for Machine Learning
Computer Science
Popular sources for Machine Learning datasets
1. UCI Machine Learning Repository
UCI Machine learning repository is one of the great sources of machine learning
datasets. This repository contains databases, domain theories, and data
generators that are widely used by the machine learning community for the
analysis of ML algorithms.
The link for the UCI machine learning repository is
https://archive.ics.uci.edu/ml/index.php
Computer Science
Popular sources for Machine Learning datasets
2. Google's Dataset Search Engine
Google dataset search engine is a
search engine launched
by Google on September 5, 2018. This
source helps researchers to get online
datasets that are freely available for
use.
The link for the Google dataset search engine is
https://toolbox.google.com/datasetsearch
Computer Science
Popular sources for Machine Learning datasets
3. Microsoft Datasets
The Microsoft has launched the "Microsoft
Research Open data" repository with the
collection of free datasets in various areas
such as natural language processing, computer
vision, and domain-specific sciences.
The link to download or use the dataset from
this resource is
https://msropendata.com/.
Computer Science
Popular sources for Machine Learning datasets
4. Scikit-learn dataset
Scikit-learn is a great source for machine
learning enthusiasts. This source
provides both toy and real-world
datasets. These datasets can be
obtained from sklearn.datasets package
and using general dataset API.
The link to download datasets from this source is
https://scikit-learn.org/stable/datasets/index.html.
Computer Science
Popular sources for Machine Learning datasets
5. Kaggle Datasets
Kaggle provides a high-quality dataset
in different formats that we can easily
find and download.
The link for the Kaggle dataset
is https://www.kaggle.com/datasets.
Computer Science
Popular sources for Machine Learning datasets
6. Computer Vision Datasets
Visual data provides multiple
numbers of the great dataset that
are specific to computer visions
such as Image Classification, Video
classification, Image Segmentation,
etc.
The link for downloading the dataset from this source is
https://www.visualdata.io/.
Computer Science
Popular sources for Machine Learning datasets
7. Awesome Public Dataset Collection
The link to download the dataset from Awesome public dataset
collection is https://github.com/awesomedata/awesome-public-datasets.
0 Comments