INTRODUCTION
When it comes to data science initiatives, no one ever appears to be able to give a clear explanation of how the entire process works. From data collection to data analysis and presentation.
In this write up, I dissect the data science framework, walking you through each stage of the project lifecycle.
Firstly, Let start with What DATA SCIENCE VS DATA ANALYSIS is all about
Data science is a concept that combines data cleansing, preparation, and analysis and is used to deal with large datasets. A data scientist collects data from a variety of sources and uses machine learning, predictive analytics, and sentiment analysis to extract useful information from it. They can provide accurate predictions and insights that may be used to support crucial business decisions since they understand data from a business perspective.
Data analysis is the process of cleaning, changing, and processing raw data, and extracting actionable, relevant information that helps businesses make informed decisions. A data analyst are individuals that perform basic descriptive statistics, visualize data, and communicate data points in order to provide clarity and informed decisions. They must have a fundamental understanding of statistics, a thorough understanding of databases, the capacity to design new views, and the ability to visualize data. Data analytics is mostly thought of as the first step in data science.
IN DETAILS
Data analysts extract useful insights from numerous data sources and looks for answers to questions already asked either by the board of directors or Management of the organization, whereas data scientists are expected to foresee the future based on historical patterns in a dataset.
PROCESS OF DATA SCIENCE
Data Science Life Cycle is essentially comprised of :
Data collection
Data cleaning
Exploratory data analysis (EDA)
Model building and model deployment.
Result Interpretation
1. Obtain Data: Data Collection also known as Obtain Data is the very first step in data science that will be needed from available data sources. There are various data source to acquire dataset for example
- Experiment Observations
- Questionnaires survey
- Interviews
- Documents (pdf)
- Tests result
- Database
Most dataset are are mostly in Microsoft Excel format and query databases. Other File format for dataset include CSV (Comma Separated value) or TSV (Tab Separated Values) , HTML , SQL etc.
CSV File
The different type of databases you may encounter are like PostgreSQL, Oracle, or even non-relational databases (NoSQL) like MongoDB. Another way to obtain data is to scrape from the websites using web scraping tools such as Beautiful Soup.
Another popular option to gather data is connecting to Web APIs. Websites such as Facebook and Twitter allows users to connect to their web servers and access their data. All you need to do is to use their Web API to crawl their data.
NOTE: One of the most Popular Website for Data Science is Kaggle
2. Scrub Data: Data Cleaning also called Scrub Data. After obtaining data, the next thing to do is scrubbing data. This process involves “cleaning” and filtering the data. Remember the term “garbage in, garbage out” (GIGO) if the data is unfiltered and irrelevant, the results of the analysis will be meaningless.
Think of this process as organizing and tidying up the data, removing what is no longer needed, replacing what is missing and standardizing the format across all the data collected.
3. Explore Data Once our data is properly cleaned and ready to be used and before we jump into AI and Machine Learning, we will have to examine the data.
To achieve that, we will need to explore the data. First of all, we will need to inspect the data and its properties such data types like numerical data, categorical data, ordinal and nominal data etc.
Then, the next step is to compute descriptive statistics to extract features and test significant variables. Testing significant variables often is done with correlation. For example, exploring the risk of someone getting high diabetics in relations to their Hand weight and foot size. Note that this variable (Hand weight and foot size) has no relevance to the objective of the dataset.
4. Model Data: This is the stage where most people consider interesting. As many people call it “where the magic happens”.
Once again, before reaching this stage, we need to bear in mind that the scrubbing and exploring stage are equally crucial to building useful models. So we most take time on those stages instead of jumping right to this process.
One of the first things we need to do in modelling data is to reduce the dimensionality of your data set. Not all your features or values are essential for predicting your model. What you need to do is to select the relevant ones that contribute to the prediction of results.
| NOTE: There are Two things to Note before implementing Machine Learning models
- Machine Learning Model rely on 99.9% Clean Dataset (No missing values)
- Machine Learning Model rely on large to Medium dataset (you cant feed a Model with small quantity of Data) NB: The more the data, the more the ML Model identifies Hidden patterns
5. Result Interpretation: This is the stage where most people consider very vital. As many people call it “your soft skill is the key to a good presentation”. Soft skill is what makes all our work as a data scientist or analyst acceptable by the management or board of directors. This skills include good power point slides, communication, writing, critical thinking and domain knowledge.
CAREER APPLICATION OF DATA SCIENCE
HOW TO BREAK IN
- Step 0: Figure out what you need to learn to get into the labor market
- Step 1: Get comfortable with Python or R.
- Step 2: Learn data analysis, manipulation, and visualization with pandas.
- Step 3: Learn machine learning with scikit-learn.
- Step 4: Understand machine learning in more depth.
- Step 5: Keep learning and practicing.
TOOLS USED BY DATA SCIENTISTS
Read more from the list of resources
List of Resources
- 5 steps of a data science project lifecycle
- Tools Every Data Analyst Should Know
- Check My YouTube Channel for more Explanation