Chapter 1 Data Preprocessing/Cleaning

Loading the Dataset

  • Load the dataset from csv into a data frame named BitCoin.
  • Check the data types of the features.
  • Assign appropriate data type to features.
  • Check the structure of the data frame.
  • Check if there’s any missing value. If yes, treat missing values through appropriate methodology.

Load the dataset from csv into a data frame named BitCoin.

#("tidyverse")
#Load the dataset from csv into a data frame named BitCoin.
library(tidyverse)
## ── Attaching core tidyverse packages ───
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#load csv from github(uploaded for Assignment_02_Group_04)
bitcoin <- read.csv("https://raw.githubusercontent.com/shakibed/Final_Project_Group_04/main/BTC-Monthly.csv", stringsAsFactors = FALSE)
view(bitcoin)

Check the data types of the features.

#Check the data types of the features.
glimpse(bitcoin)
## Rows: 107
## Columns: 2
## $ Date  <chr> "2015-01-01", "2015-02-01", "2015-03-01", "2015-04-01", "2015-05…
## $ Close <dbl> 217.464, 254.263, 244.224, 236.145, 230.190, 263.072, 284.650, 2…

Code explanation here

We checked the data types using glimpse(bitcoin) and found we have two columns in data set. We need to convert the Date column from chr (character) type to Date type because our target is to work with a time series object or perform linear regression analysis.

Assign appropriate data type to features.

bitcoin$Date <- as.Date(bitcoin$Date)
glimpse(bitcoin)
## Rows: 107
## Columns: 2
## $ Date  <date> 2015-01-01, 2015-02-01, 2015-03-01, 2015-04-01, 2015-05-01, 201…
## $ Close <dbl> 217.464, 254.263, 244.224, 236.145, 230.190, 263.072, 284.650, 2…

Code explanation here

  • This will allow us to properly utilize the Date column in time series analysis or linear regression models.

Check the structure of the data frame.

#Check the data types of the features.
str(bitcoin)
## 'data.frame':    107 obs. of  2 variables:
##  $ Date : Date, format: "2015-01-01" "2015-02-01" ...
##  $ Close: num  217 254 244 236 230 ...

Code explanation here

  • We checked the structure of the data frame using str(bitcoin) and found the following:
  • The Date column is already in the Date format, and the Close column is of type num (numeric).
  • Since the Date column is correctly formatted as a date, it is ready for time series analysis or linear regression models without any further modification.

Check if there’s any missing value. If yes, treat missing values through appropriate methodology.

#Check if there’s any missing value. If yes, treat missing values through appropriate methodology
colSums(is.na(bitcoin))
##  Date Close 
##     0     0
#it shows no missing value found

Code explanation here

  • We checked for missing values in the bitcoin data frame using colSums(is.na(bitcoin)) and found the following:
  • This indicates that there are no missing values in either the Date or Close columns. Since there are no missing values, no further action is needed to treat missing values.