In data analysis, you may need to address missing values, negative values, or non-accurate values that are present in the dataset. These problems can be addressed by replacing the values with 0
, NA
, or the mean.
In this article, you will explore how to use the replace()
and is.na()
functions in R.
To complete this tutorial, you will need:
replace()
This section will show how to replace a value in a vector.
The replace()
function in R syntax includes the vector, index vector, and the replacement values:
replace(target, index, replacement)
First, create a vector:
df <- c('apple', 'orange', 'grape', 'banana')
df
This will create a vector with apple
, orange
, grape
, and banana
:
Output"apple" "orange" "grape" "banana"
Now, let’s replace the second item in the list:
dy <- replace(df, 2, 'blueberry')
dy
This will replace orange
with blueberry
:
Output"apple" "blueberry" "grape" "banana"
Now, we’ll replace the fourth item in the list:
dx <- replace(dy, 4, 'cranberry')
dx
This will replace banana
with cranberry
:
Output"apple" "blueberry" "grape" "cranberry"
NA
Values with 0
in RConsider a scenario where you have a data frame containing measurements:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
Here is the data in CSV format:
Ozone,Solar.R,Wind,Temp,Month,Day
41,190,7.4,67,5,1
36,118,8.0,72,5,2
12,149,12.6,74,5,3
18,313,11.5,62,5,4
NA,NA,14.3,56,5,5
28,NA,14.9,66,5,6
23,299,8.6,65,5,7
19,99,13.8,59,5,8
8,19,20.1,61,5,9
NA,194,8.6,69,5,10
7,NA,6.9,74,5,11
16,256,9.7,69,5,12
This contains the string NA
for “Not Available” for situations where the data is missing.
You can replace the NA
values with 0
.
First, define the data frame:
df <- read.csv('air_quality.csv')
Use is.na()
to check if a value is NA
. Then, replace the NA
values with 0
:
df[is.na(df)] <- 0
df
The data frame is now:
Output Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 0 0 14.3 56 5 5
6 28 0 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 0 194 8.6 69 5 10
11 7 0 6.9 74 5 11
12 16 256 9.7 69 5 12
All occurrences of NA
in the data frame have been replaced.
NA
Values with the Mean of the Values in RIn the data analysis process, accuracy is improved in many cases by replacing NA
values with a mean value. The mean()
function calculates the mean value.
To overcome this situation, the NA
values are replaced by the mean of the rest of the values. This method has proven vital in producing good accuracy without any data loss.
Consider the following input data set with NA
values:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
df <- read.csv('air_quality.csv')
Use is.na()
and mean()
to replace NA
:
df$Ozone[is.na(df$Ozone)] <- mean(df$Ozone, na.rm = TRUE)
First, this code finds all the occurrences of NA
in the Ozone
column. Next, it calculates the mean of all the values in the Ozone
column - excluding the NA
values with the na.rm
argument. Then each instance of NA
is replaced with the calculated mean.
Then round()
the values to whole numbers:
df$Ozone <- round(df$Ozone, digits = 0)
The data frame is now:
Output Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 21 NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 21 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
The NA
values in the Ozone
column are now replaced by the rounded mean of the values in the Ozone
column (21
).
0
or NA
in RIn the data analysis process, sometimes you will want to replace the negative values in the data frame with 0
or NA
. This is necessary to avoid the negative tendency of the results. The negative values present in a dataset will mislead the analysis and produce false accuracy.
Consider the following input data set with negative values:
count entry1 entry2 entry3
1 1 345 -234 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 876 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 -456 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 -87 234
Here is the data in CSV format:
count,entry1,entry2,entry3
1,345,-234,345
2,65,654,867
3,23,345,3456
4,87,867,9
5,2345,34,867
6,876,98,76
7,35,-456,123
8,87,98,345
9,-765,67,765
10,4567,-87,234
Read the CSV file:
df <- read.csv('negative_values.csv')
0
Use replace()
to change the negative values in the entry2
column to 0
:
data_zero <- df
data_zero$entry2 <- replace(df$entry2, df$entry2 < 0, 0)
data_zero
The data frame is now:
Output count entry1 entry2 entry3
1 1 345 0 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 867 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 0 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 0 234
The negative values in the entry2
column have been replaced with 0
.
NA
Use replace()
to change the negative values in the entry2
column to NA
:
data_na <- df
data_na$entry2 <- replace(df$entry2, df$entry2 < 0, NA)
data_na
The data frame is now:
Output count entry1 entry2 entry3
1 1 345 NA 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 867 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 NA 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 NA 234
The negative values in the entry2
column have been replaced with NA
.
Replacing values in a data frame is a convenient option available in R for data analysis. Using replace()
in R, you can switch NA
, 0
, and negative values when appropriate to clear up large datasets for analysis.
Continue your learning with How To Use sub()
and gsub()
in R.
The sub()
and gsub()
functions in R will substitute the string or the characters in a vector or a data frame with a specific string. These functions are useful when performing changes on large data sets.
In this article, you will explore how to use sub()
and gsub()
functions in R.
To complete this tutorial, you will need:
sub()
and gsub()
The basic syntax for sub()
is:
sub(pattern, replacement, x)
The basic syntax for gsub()
is:
gsub(pattern, replacement, x)
The syntax for sub()
and gsub()
requires a pattern, a replacement, and the vector or data frame:
The pattern can also be in the form of a regular expression (regex).
Now that you are familiar with the syntax, you can move on to implementation.
sub()
Function in RThe sub()
function in R replaces the string in a vector or a data frame with the input or the specified string.
However, the limitation of the sub()
function is that it only substitutes the first occurrence.
sub()
FunctionIn this example, learn how to substitute a string pattern with a replacement string with the sub()
function.
# the input vector
df<-"R is an open-source programming language widely used for data analysis and statistical computing."
# the replacement
sub('R','The R language',df)
Running this command generates the following output:
Output"The R language is an open-source programming language widely used for data analysis and statistical computing."
The sub()
function replaces the string 'R'
in the vector with the string 'The R language'
.
In this example, there was a single occurrence of pattern matching. Consider what happens if there are multiple occurrences of pattern matches.
# the input vector
df<-"In this tutorial, we will install R and show how to add packages from the official Comprehensive R Archive Network (CRAN)."
# the replacement
sub('R','The R language',df)
Running this command generates the following output:
"In this tutorial, we will install The R language and show how to add packages from the official Comprehensive R Archive Network (CRAN)."
In this example, you can observe that the sub()
function replaced the first occurrence of the string 'R'
with 'The R language'
. But the next occurrence in the string remains the same.
sub()
Function with a Data FrameThe sub()
function also works with data frames.
# creating a data frame
df<-data.frame(Creature=c('Starfish','Blue Crab','Bluefin Tuna','Blue Shark','Blue Whale'),Population=c(5,6,4,2,2))
# data frame
df
This will create the following data frame:
Creature Population
1 Starfish 5
2 Blue Crab 6
3 Bluefin Tuna 4
4 Blue Shark 2
5 Blue Whale 2
Then replace the characters 'Blue'
with the characters 'Green'
:
# substituting the values
sub('Blue','Green',df)
Running this command generates the following output:
Output"c(\"Starfish\", \"Green Crab\", \"Bluefin Tuna\", \"Blue Shark\", \"Blue Whale\")"
"c(5, 6, 4, 2, 2)"
You can also specify a particular column to replace all the occurrences of 'Blue'
with 'Green'
:
# substituting the values
sub('Blue','Green',df$Creature)
Running this command generates the following output:
Output"Starfish"
"Green Crab"
"Greenfin Tuna"
"Green Shark"
"Green Whale"
All instances of the characters 'Blue'
have been replaced with 'Green'
.
gsub()
Function in RThe gsub()
function in R is used for replacement operations. The function takes the input and substitutes it against the specified values.
Unlike the sub()
function, gsub()
applies a global substitution to all matches.
gsub()
FunctionIn this example, learn how to substitute a string pattern with a replacement string with the gsub()
function.
# the input vector
df<-"In this tutorial, we will install R and show how to add packages from the official Comprehensive R Archive Network (CRAN)."
This is data that has 'R'
written multiple times.
# substituting the values using gsub()
gsub('R','The R language',df)
Output"In this tutorial, we will install The R language and show how to add packages from the official Comprehensive The R language Archive Network (CThe R languageAN)."
All instances of ‘R
’ have been replaced (including the instances in "Comprehensive R Archive Network"
and "CRAN"
). The gsub()
function finds every word matching the parameter and replaces that with the input word or values.
gsub()
Function with Data FramesThe gsub()
function also works with data frames.
# creating a data frame
df<-data.frame(Creature=c('Starfish','Blue Crab','Bluefin Tuna','Blue Shark','Blue Whale'),Population=c(5,6,4,2,2))
Let’s prefix the values in the Creature
column with 'Deep Sea '
:
# substituting the values
gsub('.*^','Deep Sea ',df$Creature)
Running this command generates the following output:
Output"Deep Sea Starfish"
"Deep Sea Blue Crab"
"Deep Sea Bluefin Tuna"
"Deep Sea Blue Shark"
"Deep Sea Blue Whale"
In this example, the gsub()
function uses the regular expression (regex): .*^
. This is a pattern for the position at the start of the string.
In this article, you explored how to use sub()
and gsub()
functions in R. These functions substitute the string or the characters in a vector or a data frame with a specific string. The sub()
function applies for the first match. The gsub()
function applies for all matches.
Continue your learning with How To Use replace()
in R.
The predict()
function in R is used to predict the values based on the input data. All the modeling aspects in the R program will make use of the predict()
function in their own way, but note that the functionality of the predict()
function remains the same irrespective of the case.
In this article, you will explore how to use the predict()
function in R.
To complete this tutorial, you will need:
predict()
function in RThe predict()
function in R is used to predict the values based on the input data.
predict(object, newdata, interval)
We will need data to predict the values. For the purpose of this example, we can import the built-in dataset in R - “Cars”.
df <- datasets::cars
This will assign a data frame a collection of speed
and distance (dist
) values:
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
Next, we will use predict()
to determine future values using this data.
First, we need to compute a linear model for this data frame:
# Creates a linear model
my_linear_model <- lm(dist~speed, data = df)
# Prints the model results
my_linear_model
Executing this code will calculate the linear model results:
Call:
lm(formula = dist ~ speed, data = df)
Coefficients:
(Intercept) speed
-17.579 3.932
The linear model has returned the speed of the cars as per our input data behavior. Now that we have a model, we can apply predict()
.
# Creating a data frame
variable_speed <- data.frame(speed = c(11,11,12,12,12,12,13,13,13,13))
# Fiting the linear model
linear_model <- lm(dist~speed, data = df)
# Predicts the future values
predict(linear_model, newdata = variable_speed)
This code generates the following output:
1 2 3 4 5
25.67740 25.67740 29.60981 29.60981 29.60981
6 7 8 9 10
29.60981 33.54222 33.54222 33.54222 33.54222
Well, we have successfully predicted the future distance values based on the previous data and with the help of the linear model.
Now, we have to check the “confidence” level in our predicted values to see how accurate our prediction is.
The confidence interval in the predict function will help us to gauge the uncertainty in the predictions.
# Input data
variable_speed <- data.frame(speed = c(11,11,12,12,12,12,13,13,13,13))
# Fits the model
linear_model <- lm(dist~speed, data = df)
# Predicts the values with confidence interval
predict(linear_model, newdata = variable_speed, interval = 'confidence')
This code generates the following output:
fit lwr upr
1 25.67740 19.96453 31.39028
2 25.67740 19.96453 31.39028
3 29.60981 24.39514 34.82448
4 29.60981 24.39514 34.82448
5 29.60981 24.39514 34.82448
6 29.60981 24.39514 34.82448
7 33.54222 28.73134 38.35310
8 33.54222 28.73134 38.35310
9 33.54222 28.73134 38.35310
10 33.54222 28.73134 38.35310
You can see the confidence interval in our predicted values in the above output.
From this output, we can predict that the cars which are traveling at a speed of 11-13 mph have a likelihood to travel a distance in the range of 19.9 to 31.3 miles.
The predict()
function is used to predict the values based on the previous data behaviors and thus by fitting that data to the model.
You can also use the confidence intervals to check the accuracy of our predictions.
References
]]>We often come across situations wherein we feel the need to build customized/user-defined functions to conduct out a certain operation. With R with() function, we can operate on R expressions as well as the process of calling that function in a single go!
That is with() function
enables us to evaluate an R expression within the function to be passed as an argument. It works on data frames only. That is why the outcome of the evaluation of the R expression is done with respect to the data frame passed to it as an argument.
Syntax:
with(data-frame, R expression)
Example:
rm(list = ls())
Num <- c(100,100,100,100,100)
Cost <- c(1200,1300,1400,1500,1600)
data_A <- data.frame(Num,Cost,stringsAsFactors = FALSE)
with(data_A, Num*Cost)
with(data_A, Cost/Num)
In the above example, we have calculated the expression ‘Num*Cost’ for the data frame data_A directly in the with() function.
After which, we have calculated the expression ‘Cost/Num’ within the function as well.
The reason behind having these two statements one after another is to highlight that the with() function does not alter the original data frame at any cost. It gives the output separately for every value associated with the columns of the data frame.
Output:
> with(data_A, Num*Cost)
[1] 120000 130000 140000 150000 160000
> with(data_A, Cost/Num)
[1] 12 13 14 15 16
Having read about with() function, now let us focus on it’s twin! Haha! Just joking! Though the names of the functions sound similar, the differ in the functioning.
R within() function
calculates the outcome of the expression within itself but with a slight difference. It allows us to create a copy of the data frame and add a column that would eventually store the result of the R expression.
Syntax:
within(data frame, new-column <- R expression)
Example:
rm(list = ls())
Num <- c(100,100,100,100,100)
Cost <- c(1200,1300,1400,1500,1600)
data_A <- data.frame(Num,Cost,stringsAsFactors = FALSE)
within(data_A, Product <- Num*Cost)
within(data_A, Q <- Cost/Num)
Here, we have performed the evaluation of the same expressions that we had used for the with() function. But, here we have created a new column to store the outcome of the expression.
> within(data_A, Product <- Num*Cost)
Num Cost Product
1 100 1200 120000
2 100 1300 130000
3 100 1400 140000
4 100 1500 150000
5 100 1600 160000
> within(data_A, Q <- Cost/Num)
Num Cost Q
1 100 1200 12
2 100 1300 13
3 100 1400 14
4 100 1500 15
5 100 1600 16
By this, we have come to the end of this topic. Feel free to comment below, if you come across any question.
For more such posts related to R, stay tuned and till then, Happy Learning!! :)
]]>R is one of the most popular scripting languages for statistical programming today. The demand of R programmers has been constantly on the rise since the early 2010s and R still enjoys the status as a go-to programming language among data scientists.
R has also been adapted to deep learning these days and this helped several statisticians take on to deep learning in their respective fields easily, making R an indispensable part of the current burgeoning AI scenario.
Recommended Read: Python Data Science Libraries
R has a precursor named S (S stands for statistics) language, developed by AT&T for statistical computation. AT&T began its work on S in 1976, as a part of its internal statistical analysis environment, which was earlier implemented as FORTRAN libraries.
The man behind S was John Chambers. The single-letter name S was inspired by the ubiquitous C language for programming at the time.
R was developed by Ross Ihaka and Ross Gentleman in a project that was conceived in 1992 at the University of Auckland, New Zealand. The first version was released in 1995 and the first stable beta version came up in the year 2000.
R initially differed from S as it added lexical scoping semantics on top of the existing S functionalities. The mono-letter name R was inspired by S again, taking the first letter of both the authors’ first names.
R was developed under GNU public license and openly distributable.
S programming language was later developed into S-plus by TIBCO corporation that bought it from AT&T, by adding some advanced analytical abilities and OOP capabilities.
R still remains more dominantly used statistical programming language compared to S and S-plus, and rightly so, owing to many of its virtues.
R is thought to be the least disliked programming language. Despite all its advantages, R is far from perfect, like any other language. Before plunging into learning R, it will be useful to keep the shortcomings in mind.
R is available as a command-line interface environment at CRAN project (standing for Comprehensive R Archive Network). However, as a beginner you will learn faster with the help of an IDE, of which there are quite a few for R.
Hello folks, hope you are doing good. Today let’s focus on the substing function in R.
Substring: We can perform multiple things like extracting of values, replacement of values and more. For this we use functions like substr() and substring().
substr(x,start,stop)
substring(x,first,last=1000000L)
Where:
Well, I hope that you are pretty much clear about the syntax. Now, let’s extract some characters from the string using our substring() function in R.
#returns the characters from 1,11
df<-("Journal_dev_private_limited")
substring(df,1,11)
Output = “Journal_dev”
#returns the characters from 1-7
df<-("Journal_dev")
substring(df,1,7)
Output = “Journal”
Congratulations, you just extracted the data from the given string. As you can observe, the substring() function in R takes the start/first and last/end values as arguments and indexes the string and returns a required substring of mentioned dimensions.
With the help of substring() function, you can also replace the values in the string with your desired values. Seems to be interesting right? Then Let’s see how it works.
#returns the string by replacing the _ by space
df<-("We are_developers")
substring(df,7,7)=" "
df
Output = “We are developers”
#string replacement
df<-("R=is a language made for statistical analysis")
substring(df,2,2)=" "
df
Output = “R is a language made for statistical analysis”
Great, you did it! In this way, you can replace the values in a string with your desired value.
In the above case, you have replaced the ‘_’ (underscore) and “=” (equal sign) with a " " (space). I hope you got it better.
Till now, everything is good! But what if you are required to replace some values, which should reflect in all the strings present?
Don’t worry! We can replace the values and can make them to reflect on all the strings present.
Let’s see how it works!
#replaces the 4th letter of each string by $
df<-c("Alok","Joseph","Hayato","Kelly","Paloma","Moca")
substring(df,4,4)<-c("$")
df
Output = “Alo$” “Jos$ph” “Hay$to” “Kel$y” “Pal$ma” “Moc$”
Oh, What happened? Every 4th letter in the strings has replaced by ‘$’ sign!.
Well, that is substring() for you. It can replace the marked positions with our given value.
In the above case, every 4th letter in all the input strings was replaced by the ‘$’ sign by the substring() function. It’s incredible right? I say Yes. What about you?
We’ve already focused on rows. Now, we will be looking into the extraction of characters in the columns as well.
Let’s see how it works!.
We can create a data frame with sample data having 2 columns namely Technologies and popularity. Let’s extract some specific characters out of this data. It will be fun.
#creates the data frame
df<-data.frame(Technologies=c("Datascience","machinelearning","Deeplearning","Artificalintelligence"),Popularity=c("70%","85%","90%","95%"))
df
Technologies Popularity
1 Datascience 70%
2 machinelearning 85%
3 Deeplearning 90%
4 Artificalintelligence 95%
Yes, we have now created a data frame. Let’s extract some text. To do so, run the below code to extract characters from 8-10 in all the strings in Technologies column using substr() function in R.
#creates new column with extracted values
df$Extracted_Technologies=substr(df$Technologies,8,10)
df
Output =
Technologies Popularity Extracted_Technologies
1 Datascience_DS 70% enc
2 machinelearning_ML 85% lea
3 Deeplearning_DL 90% rni
4 Artificalintelligence_AI 95% ali
Now, you can see that we have created a new column with extracted data. Like this, you can extract the data by specifying the index values.
We saw the substr() function in action. Now, as I mentioned before, we will be looking into the str_sub() function and its way of extraction.
Let’s roll!
Again we are going to create the same data frame including the data of Technologies and its popularity as well.
df<-data.frame(Technologies=c("Datascience","machinelearning","Deeplearning","Artificalintelligence"),Popularity=c("70%","85%","90%","95%"))
df
Technologies Popularity
1 Datascience 70%
2 machinelearning 85%
3 Deeplearning 90%
4 Artificalintelligence 95%
Well, let’s make use of the str_sub() function, which will return the indexed characters as output. Taking/generating a substring in R can be done in many ways and this is one of them.
#using the str_sub function
df$Extracted_Technologies=str_sub(df$Technologies,10,15)
> df
As you can see that the str_sub() function extracted the indexed values and returns the output as shown below.
Technologies Popularity Extracted_Technologies
1 Datascience 70% ce
2 machinelearning 85% arning
3 Deeplearning 90% ing
4 Artificalintelligence 95% intell
Yes, taking or generating a substring of the given string is quite an easier task. Thanks to functions like substr(), substring(), and str_sub() which made sub stringing interesting and exciting.
That’s all for now. Don’t forget to make use of this amazing function in your computation. Happy sub-stringing!!!
More study: R documentation
]]>Hello folks, today we will be looking into the applications of the sink() function in R. We are going to try to make connections in multiple formats such as text and csv file types.
Using the sink() function, you can either print the data or you can export the data or the R output to text or CSV file types.
Let’s see how it works!
Sink(): The sink function is used to drive the output obtained in R to the external connection.
sink(file = NULL, type = c("output", "message"),split = FALSE)
Where:
With the help of sink() function, you can easily print the output to the text file as a connection. We can start this process by setting up the working directory.
To check the current working directory:
#returns the current working directory
getwd()
"C:/Users/Dell/Desktop/rfiles"
Fine. We got the working directory now. And you can also change the working directory using,
#sets the new working directory
setwd("The directory path here")
Paste the path in the setwd() function to set the new working directory. After that dont forget to confirm the changes using the ‘getwd()’ command as shown above.
I hope you are ready with your working path now. Now we are going to create a file connection and print some data into it.
Let’s see how it works.
#sinks the data into connection as text file
sink("my_first_sink.txt")
#prints numbers from 1 to 20
for (i in 1:20)
print(i)
sink()
Now you can see how neatly our R data is printed into the text file. Awesome right?
In the previous section, we have printed the data or the output to the text file. In this section, we are going to export the entire data set which is available in R by default.
Let’s see how it works.
#exports the data as text file
sink('export_dataframe.txt')
airquality
sink()
You can see that the data of air quality data set is driven to the text file as a external connection.
This is how you can easily drive the data in R to connections. You can also export as a csv file as shown below.
In this section, we are going to drive or export the data into a CSV file using the sink() function in R.
Let’s see how it works.
#export the data as csv file
sink('export_dataframe_1.csv')
iris
sink()
Well, this a CSV file that includes the exported data from the R console. sink() function in R offers the easiest way to drive the data to external connections such as a file.
So far, So good. Now, lets try to apply what we have learnt or understood by the above sections all together.
The problem statement is simple.
=> Read a data set of your choice and get a summary of the data using the function summary(). Upon doing that, drive the result into the text file as connection.
Let’s rock!!!
#reads the data
df<-datasets::airquality
df
View(df)
The first step in the problem statement is here. You can see the air quality dataset in the above image.
The summary of the data using the function summary() can be seen below.
#returns the key insights of data
summary(airquality)
Ozone Solar.R Wind Temp Month
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000
NA's :37 NA's :7
Day
Min. : 1.0
1st Qu.: 8.0
Median :16.0
Mean :15.8
3rd Qu.:23.0
Max. :31.0
This is the summary of the data which shows the minimum and maximum values, quartiles, median, mean and more insights.
Now, all you need to do is to export it into text file and make it as a external connection.
#drive the output data to txt file
sink('problem-solution.txt')
summary(airquality)
sink()
You have got all the steps right and successfully driven the data into text file as a external connection.
Now its time to end the connection.
#terminates the connection
unlink('problem-solution.txt')
The above command will delete the file connection.
To sum up all the steps,
The sink() function in R drives the R output to the external connection. You can export the data in multiple forms such as text and CSV files. You can either print the data into the connection or directly export the entire data to it.
After the data transfer, you can unlink the connection to terminate the file.
The sink() function in R is useful in many ways as it offers temporary connections to work with data.
More read: R documentation
]]>R offers the standard function sample() to take a sample from the datasets. Many business and data analysis problems will require taking samples from the data. The random data is generated in this process with or without replacement, which is illustrated in the below sections.
Let’s roll into the topic!!!
sample(x, size, replace = FALSE, prob = NULL)
You may wonder, what is taking samples with replacement?
Well, while you are taking samples from a list or a data, if you specify replace=TRUE or T, then the function will allow repetition of values.
Follow the below example which clearly explains the case.
#sample range lies between 1 to 5
x<- sample(1:5)
#prints the samples
x
Output -> 3 2 1 5 4
#samples range is 1 to 5 and number of samples is 3
x<- sample(1:5, 3)
#prints the samples (3 samples)
x
Output -> 2 4 5
#sample range is 1 to 5 and the number of samples is 6
x<- sample(1:5, 6)
x
#shows error as the range should include only 5 numbers (1:5)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
#specifing replace=TRUE or T will allow repetition of values so that the function will generate 6 samples in the range 1 to 5. Here 2 is repeated.
x<- sample(1:5, 6, replace=T)
Output -> 2 4 2 2 4 3
In this case, we are going to take samples without replacement. The whole concept is shown below.
In this case of without replacement, the function replace=F is used and it will not allow the repetition of values.
#samples without replacement
x<-sample(1:8, 7, replace=F)
x
Output -> 4 1 6 5 3 2 7
x<-sample(1:8, 9, replace=F)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
#here the size of the sample is equal to range 'x'.
x<- sample(1:5, 5, replace=F)
x
Output -> 5 4 1 3 2
As you may experience that when you take the samples, they will be random and change each time. In order to avoid that or if you don’t want different samples each time, you can make use of set.seed() function.
set.seed() - set.seed function will produce the same sequence when you run it.
This case is illustrated below, execute the below code to get the same random samples each time.
#set the index
set.seed(5)
#takes the random samples with replacement
sample(1:5, 4, replace=T)
2 3 1 3
set.seed(5)
sample(1:5, 4, replace=T)
2 3 1 3
set.seed(5)
sample(1:5, 4, replace=T)
2 3 1 3
In this section, we are going to generate samples from a dataset in Rstudio.
This code will take the 10 rows as a sample from the ‘ToothGrowth’ dataset and display it. In this way, you can take the samples of the required size from the dataset.
#reads the dataset 'Toothgrwoth' and take the 10 rows as sample
df<- sample(1:nrow(ToothGrowth), 10)
df
--> 53 12 16 26 37 27 9 22 28 10
#sample 10 rows
ToothGrowth[df,]
len supp dose
53 22.4 OJ 2.0
12 16.5 VC 1.0
16 17.3 VC 1.0
26 32.5 VC 2.0
37 8.2 OJ 0.5
27 26.7 VC 2.0
9 5.2 VC 0.5
22 18.5 VC 2.0
28 21.5 VC 2.0
10 7.0 VC 0.5
In this section, we are going to use the set.seed() function to take the samples from the dataset.
Execute the below code to generate the samples from the data set using set.seed().
#set.seed function
set.seed(10)
#taking sample of 10 rows from the iris dataset.
x<- sample(1:nrow(iris), 10)
x
--> 137 74 112 72 88 15 143 149 24 13
#displays the 10 rows
iris[x, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
137 6.3 3.4 5.6 2.4 virginica
74 6.1 2.8 4.7 1.2 versicolor
112 6.4 2.7 5.3 1.9 virginica
72 6.1 2.8 4.0 1.3 versicolor
88 6.3 2.3 4.4 1.3 versicolor
15 5.8 4.0 1.2 0.2 setosa
143 5.8 2.7 5.1 1.9 virginica
149 6.2 3.4 5.4 2.3 virginica
24 5.1 3.3 1.7 0.5 setosa
13 4.8 3.0 1.4 0.1 setosa
You will get the same rows when you execute the code multiple times. The values won’t change as we have used the set.seed() function.
Well, we will understand this concept with the help of a problem.
Problem: A gift shop has decided to give a surprise gift to one of its customers. For this purpose, they have collected some names. The thing is to choose a random name out of the list.
Hint: use the sample() function to generate random samples.
As you can see below, every time you run this code, it generates a random sample of participant names.
#creates a list of names and generates one sample from this list
sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Rossie"
sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Jolie"
sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "jack"
sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Edwards"
sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Kyle"
With the help of the above examples and concepts, you have understood how you can generate random samples and extract specific data from a dataset.
Some of you may feel relaxed if I say that R allows you to set the probabilities, as it may solve many problems. Let’s see how it works with the help of a simple example.
Let’s think of a company that is able to manufacture 10 watches. Among these 10 watches, 20% of them are found defective. Let’s illustrate this with the help of the below code.
#creates a probability of 80% good watches an 20% effective watches.
sample (c('Good','Defective'), size=10, replace=T, prob=c(.80,.20))
"Good" "Good" "Good" "Defective" "Good" "Good"
"Good" "Good" "Defective" "Good"
You can also try for different probability adjustments as shown below.
sample (c('Good','Defective'), size=10, replace=T, prob=c(.60,.40))
--> "Good" "Defective" "Good" "Defective" "Defective" "Good"
"Good" "Good" "Defective" "Good"
In this tutorial, you have learned how to generate the sample from the dataset, vector, and a list with or without replacement. The set.seed() function is helpful when you are generating the same sequence of samples.
Try taking samples from various datasets available in R and also you can import some CSV files to take samples with probability adjustments as shown.
More study: R documentation
]]>So, let us begin!!
Error metrics enable us to evaluate and justify the functioning of the model on a particular dataset.
ROC plot is one such error metric.
ROC plot, also known as ROC AUC curve is a classification error metric. That is, it measures the functioning and results of the classification machine learning algorithms.
To be precise, ROC curve represents the probability curve of the values whereas the AUC is the measure of separability of the different groups of values/labels. With ROC AUC curve
, one can analyze and draw conclusions as to what amount of values have been distinguished and classified by the model rightly according to the labels.
Higher the AUC score, better is the classification of the predicted values.
For example, consider a model to predict and classify whether the outcome of a toss is ‘Heads’ or ‘Tails’.
So, if the AUC score is high, it indicates that the model is capable of classifying ‘Heads’ as ‘Heads’ and ‘Tails’ as ‘Tails’ more efficiently.
In technical terms, the ROC curve is plotted between the True Positive Rate and the False Positive Rate of a model.
Let us now try to implement the concept of ROC curve in the upcoming section!
We can use ROC plots to evaluate the Machine learning models as well as discussed earlier. So, let us try implementing the concept of ROC curve against the Logistic Regression model.
Let us begin!! :)
In this example, we would be using the Bank Loan defaulter dataset for modelling through Logistic Regression. We would be plotting the ROC curve using plot() function
from the ‘pROC’ library. You can find the dataset here!
createDataPartition() function
from the R documentation.glm() function
to apply Logistic Regression on our dataset. Further, we test the model on the testing data using predict() function and get the values for the error metrics.roc() method
and plot the same using plot() function available in the ‘pROC’ library.rm(list = ls())
#Setting the working directory
setwd("D:/Edwisor_Project - Loan_Defaulter/")
getwd()
#Load the dataset
dta = read.csv("bank-loan.csv",header=TRUE)
### Data SAMPLING ####
library(caret)
set.seed(101)
split = createDataPartition(data$default, p = 0.80, list = FALSE)
train_data = data[split,]
test_data = data[-split,]
#error metrics -- Confusion Matrix
err_metric=function(CM)
{
TN =CM[1,1]
TP =CM[2,2]
FP =CM[1,2]
FN =CM[2,1]
precision =(TP)/(TP+FP)
recall_score =(FP)/(FP+TN)
f1_score=2*((precision*recall_score)/(precision+recall_score))
accuracy_model =(TP+TN)/(TP+TN+FP+FN)
False_positive_rate =(FP)/(FP+TN)
False_negative_rate =(FN)/(FN+TP)
print(paste("Precision value of the model: ",round(precision,2)))
print(paste("Accuracy of the model: ",round(accuracy_model,2)))
print(paste("Recall value of the model: ",round(recall_score,2)))
print(paste("False Positive rate of the model: ",round(False_positive_rate,2)))
print(paste("False Negative rate of the model: ",round(False_negative_rate,2)))
print(paste("f1 score of the model: ",round(f1_score,2)))
}
# 1. Logistic regression
logit_m =glm(formula = default~. ,data =train_data ,family='binomial')
summary(logit_m)
logit_P = predict(logit_m , newdata = test_data[-13] ,type = 'response' )
logit_P <- ifelse(logit_P > 0.5,1,0) # Probability check
CM= table(test_data[,13] , logit_P)
print(CM)
err_metric(CM)
#ROC-curve using pROC library
library(pROC)
roc_score=roc(test_data[,13], logit_P) #AUC score
plot(roc_score ,main ="ROC curve -- Logistic Regression ")
Output:
R programming provides us with another library named ‘verification’ to plot the ROC-AUC curve for a model.
In order to make use of the function, we need to install and import the 'verification' library
into our environment.
Having done this, we plot the data using roc.plot() function
for a clear evaluation between the ‘Sensitivity’ and ‘Specificity’ of the data values as shown below.
install.packages("verification")
library(verification)
x<- c(0,0,0,1,1,1)
y<- c(.7, .7, 0, 1,5,.6)
data<-data.frame(x,y)
names(data)<-c("yes","no")
roc.plot(data$yes, data$no)
Output:
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
Try implementing the concept of ROC plots with other Machine Learning models and do let us know about your understanding in the comment section.
Till then, Stay tuned and Happy Learning!! :)
]]>So, let us begin!!
Before diving deep into the concept of outliers, let us focus on the pre-processing of data values.
In the domain of data science and machine learning, pre-processing of data values is a crucial step. By pre-processing, we mean to say, that getting all the errors and noise removed from the data prior to modeling.
In our last post, we had understood about missing value analysis in R programming.
Today, we would be a focus on an advanced level of the same - Outlier detection and removal in R.
Outliers, as the name suggests, are the data points that lie away from the other points of the dataset. That is the data values that appear away from other data values and hence disturb the overall distribution of the dataset.
This is usually assumed as an abnormal distribution of the data values.
Effect of Outliers on the model -
Having understood the effect of outliers, it is now the time to work on its implementation.
At first, it is very important for us to detect the presence of outliers in the dataset.
So, let us begin. We have made use of the Bike Rental Count Prediction dataset. You can find the dataset here!
Initially, we have loaded the dataset into the R environment using the read.csv() function.
Prior to outlier detection, we have performed missing value analysis just to check for the presence of any NULL or missing values. For the same, we have made use of sum(is.na(data))
function.
#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()
#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)
### Missing Value Analysis ###
sum(is.na(bike_data))
summary(is.na(bike_data))
#From the above result, it is clear that the dataset contains NO Missing Values.
The data here contains NO missing values
Having said this now is the time to detect the presence of outliers in the dataset. To achieve this, we have saved the numeric data columns into a separate data structure/variable using c()
function.
Further, we have made use of boxplot()
function to detect the presence of outliers in the numeric variables.
BoxPlot:
From the visuals, it is clear that the variables ‘hum’ and ‘windspeed’ contain outliers in their data values.
Now, after performing outlier analysis in R, we replace the outliers identified by the boxplot() method with NULL values to operate over it as shown below.
##############################Outlier Analysis -- DETECTION###########################
# 1. Outliers in the data values exists only in continuous/numeric form of data variables. Thus, we need to store all the numeric and categorical independent variables into a separate array structure.
col = c('temp','cnt','hum','windspeed')
categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")
# 2. Using BoxPlot to detect the presence of outliers in the numeric/continuous data columns.
boxplot(bike_data[,c('temp','atemp','hum','windspeed')])
# From the above visualization, it is clear that the data variables 'hum' and 'windspeed' contains outliers in the data values.
#OUTLIER ANALYSIS -- Removal of Outliers
# 1. From the boxplot, we have identified the presence of outliers. That is, the data values that are present above the upper quartile and below the lower quartile can be considered as the outlier data values.
# 2. Now, we will replace the outlier data values with NULL.
for (x in c('hum','windspeed'))
{
value = bike_data[,x][bike_data[,x] %in% boxplot.stats(bike_data[,x])$out]
bike_data[,x][bike_data[,x] %in% value] = NA
}
#Checking whether the outliers in the above defined columns are replaced by NULL or not
sum(is.na(bike_data$hum))
sum(is.na(bike_data$windspeed))
as.data.frame(colSums(is.na(bike_data)))
Now, we check for the presence of missing data i.e. whether the outlier values have been converted to missing values properly using the sum(is.na()) function.
Output:
> sum(is.na(bike_data$hum))
[1] 2
> sum(is.na(bike_data$windspeed))
[1] 13
> as.data.frame(colSums(is.na(bike_data)))
colSums(is.na(bike_data))
instant 0
dteday 0
season 0
yr 0
mnth 0
holiday 0
weekday 0
workingday 0
weathersit 0
temp 0
atemp 0
hum 2
windspeed 13
casual 0
registered 0
cnt 0
As a result, we have converted the 2 outlier points from the ‘hum’ column and 13 outlier points from the ‘windspeed’ column into missing(NA) values.
At last, we treat the missing values by dropping the NULL values using drop_na()
function from the ‘tidyr’ library.
#Removing the null values
library(tidyr)
bike_data = drop_na(bike_data)
as.data.frame(colSums(is.na(bike_data)))
Output:
As a result, all the outliers have been effectively removed now!
> as.data.frame(colSums(is.na(bike_data)))
colSums(is.na(bike_data))
instant 0
dteday 0
season 0
yr 0
mnth 0
holiday 0
weekday 0
workingday 0
weathersit 0
temp 0
atemp 0
hum 0
windspeed 0
casual 0
registered 0
cnt 0
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. For more such posts related to R programming, stay tuned!!
Till then, Happy Learning!!:)
]]>You know that we have functions like mean, median, sd, and mode to calculate the average, middle, and dispersion of values respectively. But did you ever thought of a function which gives you min and max values in a vector or a data frame?
If so, congratulations, you have got functions named min() and max() which returns the minimum and maximum values respectively.
Sounds interesting right? Let’s see how it works!
The syntax of the min() function is given below.
min(x, na.rm = FALSE)
The syntax of the max() function is given below.
max(x, na.rm = FALSE)
In this section, we are going to find the max values present in the vector. For this, we first create a vector and then apply the max() function, which returns the max value in the vector.
#creates a vector
vector<-c(45.6,78.8,65.0,78.9,456.7,345.89,87.6,988.3)
#returns the max values present in the vector
max(vector)
Output= 988.3
Here, we are going to find the minimum value in a vector using function min(). You can create a vector and then apply min() to the vector which returns the minimum value as shown below.
#creates a vector
vector<-c(45.6,78.8,65.0,78.9,456.7,345.89,87.6,988.3)
#returns the minimum value present in the vector
min(vector)
Output = 45.6
Sometimes in the data analysis, you may encounter the NA values in a data frame as well as a vector. Then you need to bypass the NA values in order to get the desired result.
The max function won’t return any values if it encounters the NA values in the process. Hence you have to remove NA values from the vector or a data frame to get the max value.
#creates a vector having NA values
df<- c(134,555,NA,567,876,543,NA,456)
#max function won't return any value because of the presence of NA.
max(df)
#results in NA instead of max value
Output = NA
So to avoid this and get the max value we are using na.rm function to remove NA values from the vector. Now you can see that the max() function is returning the maximum value.
#max function with remove NA values is set as TRUE.
max(df, na.rm = T)
Output = 876
Just like we applied the max function in the above section, here we are going to find the minimum value in a vector having NA values.
#creates a vector having NA values
df<- c(134,555,NA,567,876,543,NA,456)
#returns NA instead of minimum value
min(df)
Output = NA
To overcome this, we are using na.rm function to remove NA values from the vector. Now you can that the min() function is returning the min value.
#creates a vector having NA values
df<- c(134,555,NA,567,876,543,NA,456)
#removes NA values and returns the minimum value in the vector
min(df, na.rm = T)
Output = 134
Till now we have dealt with numerical minimum and maximum values. If I have to tell you something, I wish to say that you can also find the min and max values for a character vector as well. Yes, you heard it right!
Let’s see how it works!
In the case of character vectors, the min and max functions will consider alphabetical order and returns min and max values accordingly as shown below.
#creates a character vector with some names
character_vector<- c('John','Angelina','Smuts','Garena','Lucifer')
#returns the max value
max(character_vector)
Output = “Smuts”
Similarly, we can find the minimum values in the character vector as well using min() function which is shown below.
#creates a character vector with some names
character_vector<- c('John','Angelina','Smuts','Garena','Lucifer')
#returns the minimum values in the vector
min(character_vector)
Output = “Angelina”
Let’s find the minimum and maximum values of a data frame by importing it. The min and max values in a dataset will give a fair idea about the data distribution.
This is the air quality dataset that is available in R studio. Note that the dataset includes NA values. With the knowledge of removing NA values using na.rm function, let’s find the min and max values in the Ozone values.
min(airquality$Ozone, na.rm=T)
Output = 1
max(airquality$Ozone, na.rm = T)
Output = 168
Let’s find the min and max values of the Temperature values in the airquality dataset.
min(airquality$Temp, na.rm = T)
Output = 56
max(airquality$Temp, na.rm = T)
Output = 97
Well, in this tutorial we have focussed on finding the minimum and maximum values in a vector, data frame, and a character vector as well.
The min and max functions can be used in both numerical as well as character vectors. You can also remove the NA values using na.rm function to get better accuracy and desired results.
That’s all for now. I hope you will find more min and max values as shown in the above sections. Don’t forget to hit the comments section for any queries.
Happy learning!!!
More study: R documentation
]]>You have heard of Google sheets. It is like Excel. It will allow you to organize, edit and analyze the different types of data. But, unlike Excel, google sheets is a web-based spreadsheet program, which encourages collaboration.
This will automatically be synced with your Google account, Google drive, and its fellow services such as google docs and slides. In google sheets, you need not save every time. It offers an autosave feature, which will update the sheets after each activity. Isn’t it cool?
If we talk about the interface, google sheets will follow Excel with reasonable changes. You are free to share the sheets for any collaboration. Most of the time, it will make our lives easy as multiple people can work on the sheets in real-time.
I think it’s enough information about google sheets, let’s dive into something exciting!
You can read google sheets data in R using the package ‘googlesheets4’. This package will allow you to get into sheets using R.
First you need to install the ‘googlesheets4’ package in R and then you have to load the library to proceed further.
#Install the required package
install.packages('googlesheets4')
#Load the required library
library(googlesheets4)
That’s good. Our ‘googlesheets4’ library is now ready to pull the data from google sheets.
You cannot read the data from google sheets right away. As Gsheets are web-based spreadsheets, they will be associated with your google mail. So, you have to allow R to access the Google sheets.
You would have used functions like read.csv or read.table to read data into R. But, here you don’t need to mention the file type. All you need is to copy the google Sheets link from the browser and paste it here and run the code.
Once you run the below code, you can see an interface for the further process.
#Read google sheets data into R
x <- read_sheet('https://docs.google.com/spreadsheets/d/1J9-ZpmQT_oxLZ4kfe5gRvBs7vZhEGhSCIpNS78XOQUE/edit?usp=sharing')
Is it OK to cache OAuth access credentials in the folder
1: Yes
2: No
You have to select option 1: YES to continue to the authorization process.
As a first step, if you are having multiple G accounts logged in, it will ask you to continue with your account as shown below.
It’s great that you have completed the authorization process and it went successfully. Now let’s see how we can read the data into R from Google sheets.
#Reads data into R
df <- read_sheet('https://docs.google.com/spreadsheets/d/1J9-ZpmQT_oxLZ4kfe5gRvBs7vZhEGhSCIpNS78XOQUE/edit?usp=sharing')
#Prints the data
df
# A tibble: 1,000 x 20
months_loan_dura~ credit_history purpose amount savings_balance employment_leng~
<chr> <dbl> <chr> <chr> <dbl> <chr>
1 < 0 DM 6 critic~ radio~ 1169 unknown
2 1 - 200 DM 48 repaid radio~ 5951 < 100 DM
3 unknown 12 critic~ educa~ 2096 < 100 DM
4 < 0 DM 42 repaid furni~ 7882 < 100 DM
5 < 0 DM 24 delayed car (~ 4870 < 100 DM
6 unknown 36 repaid educa~ 9055 unknown
7 unknown 24 repaid furni~ 2835 501 - 1000 DM
8 1 - 200 DM 36 repaid car (~ 6948 < 100 DM
9 unknown 12 repaid radio~ 3059 > 1000 DM
10 1 - 200 DM 30 critic~ car (~ 5234 < 100 DM
# ... with 990 more rows, and 14 more variables: installment_rate <chr>,
# personal_status <dbl>, other_debtors <chr>, residence_history <chr>,
# property <dbl>, age <chr>, installment_plan <dbl>, housing <chr>,
# existing_credits <chr>, default <dbl>, dependents <dbl>, telephone <dbl>,
# foreign_worker <chr>, job <chr>
Here you can see, how R can read the data from Google sheets using the function ‘read_sheet’ function.
I am also adding the dataframe here for your reference / understanding.
You don’t need to copy the sheet link to read the data. You can only copy the sheet ID and can use that with the read_sheet function. It will read the data as usual.
If you are not aware of sheet ID, I have added a sheet link and I have highlighted the Sheet ID with color. You can copy this ID can follow the same process.
https://docs.google.com/spreadsheets/d/**1J9-ZpmQT\_oxLZ4kfe5gRvBs7vZhEGhSCIpNS78XOQUE**/edit#gid=0
You can find the discussed code below.
#Reads the data with Sheet ID into R
df <- read_sheet('1J9-ZpmQT_oxLZ4kfe5gRvBs7vZhEGhSCIpNS78XOQUE')
#Prints the data
df
This code will give the same output i.e. data. I have used credit data for the whole illustration. You can use any data for this purpose. I hope from now, reading google sheets into R is not an issue for you.
Almost all organizations use Google sheets for business operations and data works. As an analyst or an R user, it will be good if you know how to work with Google Sheets and R. It is a very simple method can you can practice this on your data and sheets ID/link. I hope you learned something which will save your time in your work. That’s all for now and Happy R!
More read: R documentation
]]>So, let us begin!! :)
Be it a matrix or a data frame, we deal with the data in terms of rows and columns. In the data analysis field, especially for statistical analysis, it is necessary for us to know the details of the object i.e. the count of the rows and columns which represent the data values.
R programming provides us with some easy functions to get the related information at ease! So, let us have a look at it one by one.
R programming helps us with ncol()
function by which we can get the information on the count of the columns of the object.
That is, ncol() function returns the total number of columns present in the object.
Syntax:
ncol(object)
We need to pass the object that contains the data. Here, the object can be a data frame or even a matrix or a data set.
Example: 01
In the below example, we have created a matrix as shown below. Further, using ncol() function, we try to get the value of the number of columns present in the matrix.
rm(list = ls())
data = matrix(c(10,20,30,40),2,6)
print(data)
print('Number of columns of the matrix: ')
print(ncol(data))
Output:
> print(data)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 10 30 10 30 10 30
[2,] 20 40 20 40 20 40
> print('Number of columns of the matrix: ')
[1] "Number of columns of the matrix: "
> print(ncol(data))
[1] 6
Example 02:
Here, we have imported the Bank Loan Defaulter prediction dataset into the R environment using read.csv() function. You can find the dataset here!
Using ncol() function, we detect and extract the count of columns in the dataset.
rm(list = ls())
getwd()
#Load the dataset
dta = read.csv("bank-loan.csv",header=TRUE)
print('Number of columns: ')
print(ncol(dta))
Output:
Number of columns:
9
Having understood about columns, it’s now the time to discuss about the rows for an object.
R provides us nrow()
function to get the rows for an object. That is, with nrow() function, we can easily detect and extract the number of rows present in an object that can be matrix, data frame or even a dataset.
Syntax:
nrow(object)
Example 01:
In this example, we have created a matrix using matrix()
function in R. Further, we have performed the nrow() function to get the number of rows present in the matrix as shown–
rm(list = ls())
data = matrix(c(10,20,30,40),2,6)
print(data)
print('Number of rows of the matrix: ')
print(nrow(data))
Output:
> print(data)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 10 30 10 30 10 30
[2,] 20 40 20 40 20 40
"Number of rows of the matrix: "
[1] 2
Example 02:
Now, in this example, we have made use of the same Bank Load Defaulter dataset as mentioned the ncol() function section above!
Having loaded the dataset into the R environment, we make use of nrow()
function to extract the number of rows present in the dataset.
rm(list = ls())
getwd()
#Load the dataset
dta = read.csv("bank-loan.csv",header=TRUE)
print('Number of rows: ')
print(nrow(dta))
Output:
"Number of rows: "
850
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more such posts related to R programming, Stay tuned with us.
Till then, Happy Learning!! :)
]]>Let’s dive !!!
Seq(): The seq function in R can generate the general or regular sequences from the given inputs.
seq(from, to, by, length.out, along.with)
Where:
Well, I know you are super excited to generate a sequence using seq() in R. Without much delay, let’s see how it works.
In this sample, the first number represents ‘from’ and last number represents ‘to’ arguments.
Serial Numbers:
seq(from=1,to=10)
Output:
1 2 3 4 5 6 7 8 9 10
Decimal Numbers:
seq(1.0,10.0)
Output:
1 2 3 4 5 6 7 8 9 10
Negative Numbers:
seq(-1,-10)
Output:
-1 -2 -3 -4 -5 -6 -7 -8 -9 -10
In this section, along with from and to arguments, we are using ‘by’ argument as well.
The by argument will increment the given number in the sequence as shown below.
Here, I am illustrating the sample using the keywords as well for the proper view.
seq(from=1,to=10,by=2)
The Output:
1 3 5 7 9
In the above output, you can observe that the argument ‘by’ increments the sequence by 2 i.e. The beginning number of the sequence 1 gets incremented by 2 each time till the sequence ends at 10.
seq(from=3,to=30,by=3)
Result:
3 6 9 12 15 18 21 24 27 30
You can also do this without keywords if you know the syntax well. You will get the same output without keywords. But it’s always recommended to use the keywords for proper documentation and readability.
seq(3,30,3)
The Result:
3 6 9 12 15 18 21 24 27 30
Length.out is the argument that decides the total length of the sequence.
Let’s see how it works with some illustrations:
seq.int(from=3,to=30,length.out=10)
Output:
3 6 9 12 15 18 21 24 27 30
As you can observe in the above output, the length.out argument will structures the sequence with the specified length.
Let’s use this argument to generate a negative sequence.
seq(from=-3,to=-30,length.out= 10)
Output =
-3 -6 -9 -12 -15 -18 -21 -24 -27 -30
Along.with argument takes an input vector and outputs a new sequence of the same length as the input vector within the specified range of numbers.
Don’t worry about the above lines too much. I will illustrate this with simple examples.
y<-c(5,10,15,20)
seq(1,25,along.with = y)
Output:
1 9 17 25
df<-c(-1,-5,-10,-2,-4)
seq(-5,10,along.with = df)
Output:
-5.00 -1.25 2.50 6.25 10.00
As the headline says, you can use seq() functions with some arguments with ease. Yes, you heard it right!.
If you wonder, how you can pass the arguments to seq() directly, don’t worry. Follow the below illustration to understand this easily.
seq_len(5)
Output =
1 2 3 4 5
seq_len(10)
Output =
1 2 3 4 5 6 7 8 9 10
seq_len(-10)
Output =
Error in seq_len(-10):
argument must be coercible to non-negative integer
seq.int(-5,5)
-5 -4 -3 -2 -1 0 1 2 3 4 5
seq.int(2,10)
2 3 4 5 6 7 8 9 10
The seq() function in R is a valuable addition to the list of functions present in R. Using the function, you can generate the regular sequences by passing various arguments as well.
This article is concentrated on the seq() function and it’s various arguments which are illustrated in the above sections. Hope you got some good understating on this topic. Happy sequencing!!!
More Study: R documentation
]]>There are many ways to handle missing data in R. You can drop those records. But, keep in mind that you are dropping information when you do so and may lose a potential edge in modeling. On the other hand, you can impute the missing data with the mean and median of the data. In this article, we will be looking at filling Missing Values in R using the Tidyr package.
Tidyr is a R package which offers many functions to assist you in tidy the data. Greater the data quality, Better the model!
To install the Tidyr package in R, run the below code in R.
#Install tidyr package
install.packages('tidyr')
#Load the library
library(tidyr)
package ‘tidyr’ successfully unpacked and MD5 sums checked
You will get the confirmation message after successful loading of the tidyr as shown above.
Yes, we have to create a simple sample data frame that has missing values. This will help us in using the fill function of tidyr to fill the missing data.
#Create a dataframe
a <- c('A','B','C','D','E','F','G','H','I','J')
b <- c('Roger','Carlo','Durn','Jessy','Mounica','Rack','Rony','Saly','Kelly','Joseph')
c <- c(86,NA,NA,NA,88,NA,NA,86,NA,NA)
df <- data.frame(a,b,c)
df
a b c
1 A Roger 86
2 B Carlo NA
3 C Durn NA
4 D Jessy NA
5 E Mounica 88
6 F Rack NA
7 G Rony NA
8 H Saly 86
9 I Kelly NA
10 J Joseph NA
Well, we got our data frame but with a lot of missing values. So, in these cases where your data has more and more missing values, you can make use of the fill function in R to fill the corresponding values/neighbor values in place of missing data.
Yes, you can fill in the data as I said earlier. This process includes two approaches -
Didn’t get it?
Don’t worry. We will be going through some examples to illustrate the same and you will get to know how things work.
In this process, we have a data frame with 3 columns and 10 data records in it. Before using the fill function to handle the missing data, you have to make sure of some things -
Sometimes when the data is collected, people may enter 1 value as a representation of some values, because they were the same.
Ex: When collecting the age, if there were 10 people whose age is 25, you can mention 25 against the last person indicating that all 10 people’s age is 25.
Please note that it is not the most common situation you face. But, the intention of this is to make sure, when you are in this kind of space, you can use the fill function to deal with this.
#Dataframe
a b c
1 A Roger 86
2 B Carlo NA
3 C Durn NA
4 D Jessy NA
5 E Mounica 88
6 F Rack NA
7 G Rony NA
8 H Saly 86
9 I Kelly NA
10 J Joseph NA
#Creste new dataframe by filling missing values (Up)
df1 <- df %>% fill(c, .direction = 'up')
df1
a b c
1 A Roger 86
2 B Carlo 88
3 C Durn 88
4 D Jessy 88
5 E Mounica 88
6 F Rack 86
7 G Rony 86
8 H Saly 86
9 I Kelly NA
10 J Joseph NA
You can observe that, the fill function filled the missing values using UP direction (Bottom - Up).
Well, here we will be using the ‘Down’ method to fill the missing values in the data. Always make sure of some assumptions which I have mentioned in the earlier section to understand what you are doing and what will be the outcome.
#Data
a b c
1 A Roger 86
2 B Carlo NA
3 C Durn NA
4 D Jessy NA
5 E Mounica 88
6 F Rack NA
7 G Rony NA
8 H Saly 86
9 I Kelly NA
10 J Joseph NA
#Creates new dataframe by filling missing values (Down) - (Top-Down approach)
df1 <- df %>% fill(c, .direction = 'down')
df1
a b c
1 A Roger 86
2 B Carlo 86
3 C Durn 86
4 D Jessy 86
5 E Mounica 88
6 F Rack 88
7 G Rony 88
8 H Saly 86
9 I Kelly 86
10 J Joseph 86
Filling Missing values in R is the most important process when you are analyzing any data which has null values. Things may seem a bit hard for you, but make sure you through the article once or twice to understand it concisely. It’s not a hard cake to digest!.
I hope this method will come to your assistance in your future assignments. That’s all for now. Happy R!!! :)
More read: Fill function in R
]]>So, let us begin!!
In Statistics, Covariance is the measure of the relation between two variables of a dataset. That is, it depicts the way two variables are related to each other.
For an instance, when two variables are highly positively correlated, the variables move ahead in the same direction.
Covariance is useful in data pre-processing prior to modelling in the domain of data science and machine learning.
In R programming, we make use of cov() function
to calculate the covariance between two data frames or vectors.
Example:
We provide the below three parameters to the cov() function–
a <- c(2,4,6,8,10)
b <- c(1,11,3,33,5)
print(cov(a, b, method = "spearman"))
Output:
> print(cov(a, b, method = "spearman"))
[1] 1.25
Correlation on a statistical basis is the method of finding the relationship between the variables in terms of the movement of the data. That is, it helps us analyze the effect of changes made in one variable over the other variable of the dataset.
When two variables are highly (positively) correlated, we say that the variables depict the same information and have the same effect on the other data variables of the dataset.
The cor() function
in R enables us to calculate the correlation between the variables of the data set or vector.
Example:
a <- c(2,4,6,8,10)
b <- c(1,11,3,33,5)
corr = cor(a,b)
print(corr)
print(cor(a, b, method = "spearman"))
Output:
> print(corr)
[1] 0.3629504
> print(cor(a, b, method = "spearman"))
[1] 0.5
R provides us with cov2cor() function
to convert the covariance value to correlation. It converts the covariance matrix into a correlation matrix of values.
Note: The vectors or values passed to build cov() needs to be a square matrix in this case!
Example:
Here, we have passed two vectors a and b such that they obey all the terms of a square matrix. Further, using cov2cor() function, we achieve a corresponding correlation matrix for every pair of the data values.
a <- c(2,4,6,8)
b <- c(1,11,3,33)
covar = cov(a,b)
print(covar)
res = cov2cor(covar)
print(res)
Output:
> covar = cov(a,b)
> print(covar)
[1] 29.33333
> print(res)
[,1] [,2] [,3]
[1,] 6000 21 1200
[2,] 5 32 2100
[3,] 12 500 3200
By this, we have come to the end of this topic. Here, we have understood about the in-built functions to calculate correlation and covariance in R. Moreover, we have even seen function in R that helps us translate a covariance value into a correlation data.
Feel free to comment below, in case you come across any question. For more such posts related to R, Stay tuned.
Till then, Happy Learning!! :)
]]>strsplit() is an exceptional R function, which splits the input string vector into sub-strings. Let’s see how this function works and what are all the ways to perform splitting of the strings in R using the strsplit().
Strsplit(): An R Language function which is used to split the strings into substrings with split arguments.
strsplit(x,split,fixed=T)
Where:
In this section, let’s see a simple example that shows the use case of the strsplit() function. In this case, the strsplit() function will split the given input into a list of strings or values.
Let’s see how it works.
df<-("R is the statistical analysis language")
strsplit(df, split = " ")
Output =
"R" "is" "the" "statistical" "analysis" "language"
We have done it! In this way, we can easily split the strings present in the data. One of the best use cases of strsplit() function is in plotting the word clouds. In that, we need tons of word strings to plot the most popular or repeated word. So, in order to get the strings from the data we use this function which returns the list of strings.
A delimiter in general is a simple symbol, character, or value that separates the words or text in the data. In this section, we will be looking into the use of various symbols as delimiters.
df<-"get%better%every%day"
strsplit(df,split = '%')
Output =
"get" "better" "every" "day"
In this case, the input text has the % as a delimiter. Now, our concern is to remove the delimiter and get the text as a list of strings. The strsplit() function has done the same here. It removed the delimiter and returned the strings as a list.
In this section, we will be looking into the splitting of text using regular expressions. Sounds interesting? Let’s do it.
df<-"all16i5need6is4a9long8vacation"
strsplit(df,split = "[0-9]+")
Output =
"all" "i" "need" "is" "a" "long" "vacation"
In this example, our input has the numbers lies between 0-9. hence we used the regular expression as [0-9]+ to split the data by removing the numbers. The strsplit() function will return a list of strings as output as shown above.
Till now, we have came across various types of splitting a given string. Now, what if we want to split each and every character of the string? Well, we use the strsplit() function with different split argument to extract each character.
Let’s see how it wokrs.
df<-"You can type q() in Rstudio to quit R"
strsplit(df,split="")
Output =
"Y" "o" "u" " " "c" "a" "n" " " "t" "y" "p" "e" " " "q" "(" ")" " " "i"
"n" " " "R" "s" "t" "u" "d" "i" "o" " " "t" "o" " " "q" "u" "i" "t" " "
"R"
The another best application of the strsplit() function is, splitting the dates. This use case is so cool and worth doing it. In this section, let’s see how this works.
test_dates<-c("24-07-2020","25-07-2020","26-07-2020","27-07-2020","28-07-2020")
test_mat<-strsplit(test_dates,split = "-")
test_mat
Output =
"24" "07" "2020"
"25" "07" "2020"
"26" "07" "2020"
"27" "07" "2020"
"28" "07" "2020"
You can see a good looking output right? Using this function, we can create numerous splits from the input strings or data as well. You can also convert the dates into matrix format.
matrix(unlist(test_mat),ncol=3,byrow=T)
Output =
[,1] [,2] [,3]
[1,] "24" "07" "2020"
[2,] "25" "07" "2020"
[3,] "26" "07" "2020"
[4,] "27" "07" "2020"
[5,] "28" "07" "2020"
You can see the above results where we have created a matrix from the split data. Ba cause organising the data is very important for further process. Merely splitting the text doesn’t make any sense unless it is transformed or organised to a reliable form like above sample.
Well, we are at the end of the article and I hope you now have a better understanding about the working and use cases of the strsplit() function in R. This function is widely used and most popular in terms of splitting the strings. That’s all for now. Will be back with another function another day.
More study: R documentation
]]>The R melt() and cast() functions help us to reshape the data within a data frame into any customized shape.
Let’s understand both the functions in detail. Here we go!
The melt() function
in R programming is an in-built function. It enables us to reshape and elongate the data frames in a user-defined manner. It organizes the data values in a long data frame format.
Have a look at the below syntax!
Syntax:
melt(data-frame, na.rm = FALSE, value.name = “name”, id = 'columns')
We pass the data frame to the reshaped to the function along with na.rm = FALSE as the default value which means the NA values won’t be ignored.
Further, we pass the new variable/column name to value.name parameter to store the elongated values obtained from the function into it.
The ID parameter is set to the column names of the data frame with respect to which the reshaping would happen.
Example:
In this example, we would be making use of libraries ‘MASS, reshape2, and reshape’. Having created the data frame, we apply the melt() function on the data frame with respect to the column A and B.
rm(list = ls())
install.packages("MASS")
install.packages("reshape2")
install.packages("reshape")
library(MASS)
library(reshape2)
library(reshape)
A <- c(1,2,3,4,2,3,4,1)
B <- c(1,2,3,4,2,3,4,1)
a <- c(10,20,30,40,50,60,70,80)
b <- c(100,200,300,400,500,600,700,800)
data <- data.frame(A,B,a,b)
print("Original data frame:\n")
print(data)
melt_data <- melt(data, id = c("A","B"))
print("Reshaped data frame:\n")
print(melt_data)
Output:
[1] "Original data frame:\n"
A B a b
1 1 1 10 100
2 2 2 20 200
3 3 3 30 300
4 4 4 40 400
5 2 2 50 500
6 3 3 60 600
7 4 4 70 700
8 1 1 80 800
[1] "Reshaped data frame:\n"
> print(melt_data)
A B variable value
1 1 1 a 10
2 2 2 a 20
3 3 3 a 30
4 4 4 a 40
5 2 2 a 50
6 3 3 a 60
7 4 4 a 70
8 1 1 a 80
9 1 1 b 100
10 2 2 b 200
11 3 3 b 300
12 4 4 b 400
13 2 2 b 500
14 3 3 b 600
15 4 4 b 700
16 1 1 b 800
As seen above, after applying melt() function, the data frame gets converted to an elongated data frame. In order to regain the nearly original and natural shape of the data frame, R cast() function
is used.
The cast() function accepts an aggregated function and a formula as a parameter (here, formula is the manner in which the data is to be represented after reshaping) and casts the elongated or molted data frame into a nearly aggregated form of data frame.
Syntax:
cast(data, formula, aggregate function)
We can provide the cast() function with any aggregate function available such as mean, sum, etc.
Example:
rm(list = ls())
library(MASS)
library(reshape2)
library(reshape)
A <- c(1,2,3,4,2,3,4,1)
B <- c(1,2,3,4,2,3,4,1)
a <- c(10,20,30,40,50,60,70,80)
b <- c(100,200,300,400,500,600,700,800)
data <- data.frame(A,B,a,b)
print("Original data frame:\n")
print(data)
melt_data <- melt(data, id = c("A"))
print("Reshaped data frame after melting:\n")
print(melt_data)
cast_data = cast(melt_data, A~variable, mean)
print("Reshaped data frame after casting:\n")
print(cast_data)
As seen above, we have passed mean as the aggregate function to cast() and have set variable equivalent to A variable as the format of representation.
Output:
[1] "Original data frame:\n"
A B a b
1 1 1 10 100
2 2 2 20 200
3 3 3 30 300
4 4 4 40 400
5 2 2 50 500
6 3 3 60 600
7 4 4 70 700
8 1 1 80 800
[1] "Reshaped data frame after melting:\n"
A variable value
1 1 B 1
2 2 B 2
3 3 B 3
4 4 B 4
5 2 B 2
6 3 B 3
7 4 B 4
8 1 B 1
9 1 a 10
10 2 a 20
11 3 a 30
12 4 a 40
13 2 a 50
14 3 a 60
15 4 a 70
16 1 a 80
17 1 b 100
18 2 b 200
19 3 b 300
20 4 b 400
21 2 b 500
22 3 b 600
23 4 b 700
24 1 b 800
[1] "Reshaped data frame after casting:\n"
A B a b
1 1 1 45 450
2 2 2 35 350
3 3 3 45 450
4 4 4 55 550
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
Till then, Happy Learning!! :)
Reference:
]]>The plot() function in R isn’t a single defined function but a placeholder for a family of related functions. The exact function being called will depend upon the parameters used. At its simplest, plot() function simply plots two vectors against each other.
plot(c(1,2,3,4,5),c(1,4,9,16,25))
This gives a simple plot for y = x^2.
The plot() function in R can be customized in multiple ways to create more complex and eye-catching plots as we will see.
colors()
function.Let us now try constructing a few graphs with what we learned so far.
We will begin by generating a sine wave plot. Let x be a sequence vector of values from -pi to pi with 0.1 intervals and y contains the respective sine values of x. Now try plotting y against x.
x=seq(-pi,pi,0.1)
y=sin(x)
plot(x,y)
Let us now try changing the symbols and colors.
plot(x,y,pch=c(4,5,6),col=c('red','blue','violet','green'))
We are now enabling the compiler to select from 3 different symbols and 4 different colors for marking the graph. Let us see how it turned out.
R also allows combining multiple graphs into a single image for our viewing convenience using the par() function. We only need to set the space before calling the plot function in our graph.
#Set a plotting window with one row and two columns.
par(mfrow=c(1,2))
plot(x,y,type='l')
plot(x,y,pch=c(4,5,6),col=c('red','blue','violet','green'))
A few more graphs using various options from above are illustrated below.
#Set space for 2 rows and 3 columns.
par(mfrow=c(2,3))
#Plot out the graphs using various options.
plot(x,cos(x),col=c('blue','orange'),type='o',pch=19,lwd=2,cex=1.5)
plot(x,x*2,col='red',type='l')
plot(x,x^2/3+4.2, col='violet',type='o',lwd=2,lty=1)
plot(c(1,3,5,7,9,11),c(2,7,5,10,8,10),type='o',lty=3,col='pink',lwd=4)
plot(x<-seq(1,10,0.5),50*x/(x+2),col=c('green','dark green'),type='h')
plot(x,log(x),col='orange',type='s')
The resulting graph looks like this.
Graphs look more complete when there are notes and information that explain them. These include a title for the chart and axes, a legend of the graph. Sometimes even labeling the data points will be necessary. Let us look at how we add these to the graphs in R.
Let us look at examples illustrating these.
#Displaying the title with color
plot(c(1,3,5,7,9,11),c(2,7,5,10,8,10),type='o',lty=3, col='pink',lwd=4,main="This is a graph",col.main='blue')
#Same graph with xlabel and ylabel added.
> plot(c(1,3,5,7,9,11),c(2,7,5,10,8,10),type='o',lt=3,col='pink',lwd=4,main="This is a graph",col.main='blue',xlab="Time",ylab="Performance")
Let us add a label to each of the data points in the graph using a text attribute.
labelset <-c('one','three','five','seven','nine','eleven')
x1<- c(1,3,5,7,9,11)
y1 <- c(2,7,5,10,8,10)
plot(x1,y1,type='o',lty=3,col='pink',lwd=4,main="This is a graph",col.main='blue',xlab="Time",ylab="Performance")
text(x1+0.5,y1,labelset,col='red')
Finally, let us add a legend to the above graph using the legend() function.
> legend('topleft',inset=0.05,"Performace",lty=3,col='pink',lwd=4)
The position can be specified by either x and y coordinates or using a position like ‘topleft’ or ‘bottomright’. Inset refers to moving the legend box a little to the inside of the graph. The resulting graph now has a legend.
R also allows two graphs to be displayed on top of each other instead of creating a new window for every graph. This is done by calling a lines() function for the second graph rather than plot() again. These are most useful when performing comparisons of metrics or among different sets of values. Let us look at an example.
x=seq(2,10,0.1)
y1=x^2
y2=x^3
plot(x,y1,type='l',col='red')
lines(x,y2,col='green')
legend('bottomright',inset=0.05,c("Squares","Cubes"),lty=1,col=c("red","green"),title="Graph type")
Straight lines can be added to an existing plot using the simple abline()
function. The abline() function takes 4 arguments, a, b, h, and v. The variables a and b represent the slope and intercept. H represents the y points for horizontal lines and v represents the x points for vertical lines.
Let us look at an example to make this clear. Try executing these three statements after building the above graph for squares and cubes.
abline(a=4,b=5,col='blue')
abline(h=c(4,6,8),col="dark green",lty=2)
abline(v=c(4,6,8),col="dark green",lty=2)
The first blue line is built with the slope and intercept specified. The next sets of three horizontal and vertical lines are drawn at the specified x and y values in the dotted line style as mentioned by lty=2.
This covers the basics of plot function in R. When combined with other packages like ggplot2, R builds the most presentable and dynamic graphics as we will see in further tutorials.
]]>The syntax of the sum() function is = sum(x,na.rm=FALSE/TRUE)
Vector is the easiest method to store multiple elements in R. Look at the below examples which show the various types of vectors.
Ex_vector:
V<- c(2,4,6,8,10) #This is a numerical vector
V<-c('red', 'blue', 'orange') #This is a character or string vector
V<-c(TRUE, FALSE,TRUE) #This is a logical vector
In this section, we are finding the sum of the given values. Execute the below code to find the sum of the values.
#list of values or a vector having numerical values
df<- c(23,44,66,34,56,78,97,53,24,57,34,678,643,1344)
#calculates the sum of the values
sum(df)
Output —> 3231
Sometimes your dataset may contain 'NA" values i.e. ‘Not Available’. So if you add the values including NA, the sum() functions return the NA instead of numerical summation output.
Let’s learn how to deal with such datasets.
In this section, we are finding the sum of the vectors having the numeric values along with the value 'NA. The syntax of the sum() function shows that,
sum(x,na.rm=FALSE/TRUE)
x-> it is the vector having the numeric values.
na.rm-> This asks for remove or returns ‘NA’. If you made it TRUE, then it skips the NA in the vector, otherwise, NA will be calculated.
The below code will illustrate the action.
#creates a vector having numerical values
x<-c(123,54,23,876,NA,134,2346,NA)
#calculates the sum and removes the NA values from the summation
sum(x,na.rm = TRUE)
Output —> 3556
#if you mention FALSE, the sum function returns the value NA
sum(x,na.rm = FALSE)
----> NA
Summing the values present in the particular column is very easy in R. The below code will illustrate the same.
This dataset contains the ‘NA’ value. So we are handling it by using na.rm=TRUE functon as shown in the code.
#read the data
datasets::airquality
#sample data, just a few samples
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10 continues.....
#calculates the summation of the values in column 'Ozone'.
sum(airquality$Ozone, na.rm = TRUE)
Output —> 4887
This section focuses on summing each row present in the dataset. Execute the below code to get the summed values of each row.
Here we are removing the NA values by na.rm=TRUE function.
datasets::airquality
rowSums(airquality, na.rm = TRUE)
Output: You can see the summation of all values present in each row.
[1] 311.4 241.0 255.6 413.5 80.3 119.9 407.6 203.8 122.1 286.6 103.9 367.7
[13] 394.2 385.9 174.2 444.5 441.0 182.4 455.5 151.7 103.7 447.6 127.7 226.0
[25] 169.6 369.9 97.0 148.0 426.9 457.7 435.4 379.6 378.7 334.1 289.2 324.6
[37] 369.3 260.7 380.9 480.8 476.5 379.9 369.2 280.0 445.8 433.5 325.9 436.7
[49] 155.2 241.5 262.3 260.3 164.7 200.6 362.3 249.0 245.0 163.3 223.5 157.9
[61] 265.0 500.1 400.2 368.2 206.9 338.6 460.9 460.1 477.3 482.7 373.4 247.6
[73] 380.3 317.9 417.9 171.3 418.9 425.3 461.3 384.1 406.5 131.9 377.7 418.5
[85] 499.6 456.0 224.6 266.0 425.4 454.4 444.4 441.2 218.9 137.8 193.4 182.9
[97] 140.4 171.6 485.0 434.3 432.0 340.6 253.5 353.5 415.5 333.7 177.5 204.3
[109] 220.3 247.4 390.9 350.3 401.5 161.3 373.6 377.7 523.4 416.0 281.7 421.7
[121] 476.3 461.3 412.3 370.9 383.1 363.8 390.6 250.4 238.5 378.9 348.3 354.9
[133] 384.7 395.9 392.5 371.3 137.9 231.5 392.9 348.8 153.3 368.3 336.0 357.6
[145] 148.2 298.3 168.3 147.6 334.9 271.2 331.3 271.0 361.5
Let’s find the sum of each column present in the dataset. Execute the below code to find the sum of each column.
dataseta::airquality
colSums(airquality, na.rm = TRUE)
Output:
Ozone Solar.R Wind Temp Month Day
4887.0 27146.0 1523.5 11916.0 1070.0 2418.0
The sum() function in R to find the sum of the values in the vector. This tutorial shows how to find the sum of the values, the sum of a particular row and column, and also how to get the summation value of each row and column in the dataset.
The important thing is to consider the NA values or not. If you want to eliminate it, mention TRUE otherwise it should be FALSE as shown above. That’s all for now, keep going!!!
]]>rbind() stands for row binding. In simpler terms joining of multiple rows to form a single batch. It may include joining two data frames, vectors, and more.
This article will talk about the uses and applications of rbind() function in R programming.
Without wasting much time, let’s roll in to the topic!!!
rbind(): The rbind or the row bind function is used to bind or combine the multiple group of rows together.
rbind(x,x1)
Where:
The idea of binding or combing the rows of multiple data frames is highly beneficial in data manipulation.
The below diagram will definitely get you the idea of working the rbind() function.
You can see that how rows of different data frames will bound/combined by the rbind() function.
As you know that rbind() function in R used to bind the rows of different groups of data.
In this section, let’s try to construct a simple data frames and bind them using rbind() function.
#creating a data frame
Student_details<-c("Mark","John","Fredrick","Floyd","George")
Student_class<-c("High school","College","High school","High school","College")
df1<-data.frame(Student_class,Student_details)
df1
The above code will construct a simple data frame presenting student details and names.
Student_class Student_details
1 High school Mark
2 College John
3 High school Fredrick
4 High school Floyd
5 College George
Well, now we have a dataframe of 5 rows. Let’s create another data frame.
#creating a dataframe
Student_details<-c("Bracy","Evin")
Student_class<-c("High school","College")
Student_rank<-c("A","A+")
df2<-data.frame(Student_class,Student_details,Student_rank)
df2
Student_class Student_details
1 High school Bracy
2 College Evin
Well, now we have 2 data frames of different row counts (df1 and df2). Let’s use the rbind() function to bind the above 2 data frames into a single data frame.
Let’s see how it works.
You won’t believe that the whole binding process will require just a line of code.
#binds rows of 2 input data frames
rbind(df1,df2)
Student_class Student_details
1 High school Mark
2 College John
3 High school Fredrick
4 High school Floyd
5 College George
6 High school Bracy
7 College Evin
The resultant data frame will be a bonded version of both data frames as shown in the above output.
Well, in the previous section, we have combined the two row groups together.
In this section, we are going to combine two data sets together using the rbind function in R.
#creates the data frame
Student_details<-c("Mark","John","Fredrick","Floyd","George")
Student_class<-c("High school","College","High school","High school","College")
df1<-data.frame(Student_class,Student_details)
df1
Student_class Student_details
1 High school Mark
2 College John
3 High school Fredrick
4 High school Floyd
5 College George
#creats the data frame
Student_details<-c("Bracy","Evin")
Student_class<-c("High school","College")
Student_rank<-c("A","A+")
df2<-data.frame(Student_class,Student_details,Student_rank)
df2
Student_class Student_details Student_rank
1 High school Bracy A
2 College Evin A+
rbind(df1,df2)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Oh wait, what happened? Why the function is throwing an error?
If you read the lines in the error, it is stating that the number of columns is not matching.
We have 2 columns in ‘df1’ and 3 columns in ‘df2’.
Worry not! we have got the bind_rows() function which will assist us in these scenarios.
bind_rows() is a function which is a part of dplyr package. We need to import the dplyr package first to execute this function.
We are using the same data frames present in the above section i.e df1 and df2. Let’s see how it works.
#install required packages
install.packages('dplyr')
#import libraries
library(dplyr)
#bind the rows
bind_rows(df1,df2)
Student_class Student_details Student_rank
1 High school Mark <NA>
2 College John <NA>
3 High school Fredrick <NA>
4 High school Floyd <NA>
5 College George <NA>
6 High school Bracy A
7 College Evin A+
you can now see that the bind_rows function has combined these two unequal datasets in terms of columns. The empty spaces will be marked as <NA>.
In this section, we will be looking into the binding of two entire data sets in R.
Let’s see how it works.
We are going to use the BOD data set as it has only 6 rows and also you can easily observe the bound rows.
#binds two data sets
rbind(BOD,BOD)
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
---------------
7 1 8.3
8 2 10.3
9 3 19.0
10 4 16.0
11 5 15.6
12 7 19.8
BOD dataset has 6 rows. As we are passing the data twice, the rbind() function will bind the same rows as shown above.
And also don’t forget that you have bind_rows() function as well.
#binds two different datasets
bind_rows(BOD,women)
Time demand height weight
1 1 8.3 NA NA
2 2 10.3 NA NA
3 3 19.0 NA NA
4 4 16.0 NA NA
5 5 15.6 NA NA
6 7 19.8 NA NA
7 NA NA 58 115
8 NA NA 59 117
9 NA NA 60 120
10 NA NA 61 123
11 NA NA 62 126
12 NA NA 63 129
13 NA NA 64 132
14 NA NA 65 135
15 NA NA 66 139
16 NA NA 67 142
17 NA NA 68 146
18 NA NA 69 150
19 NA NA 70 154
20 NA NA 71 159
21 NA NA 72 164
These are the examples which clearly shows the working and applications of the bind() and bind_rows functions.
I hope these illustrations helped you in understanding these functions.
In this section, we will be focusing on binding multiple (more than 2) row groups using the function rbind(). Let’s see how it works.
#binds rows of 3 data sets
bind_rows(BOD,women,ToothGrowth)
Time demand height weight len supp dose
1 1 8.3 NA NA NA <NA> NA
2 2 10.3 NA NA NA <NA> NA
3 3 19.0 NA NA NA <NA> NA
4 4 16.0 NA NA NA <NA> NA
5 5 15.6 NA NA NA <NA> NA
6 7 19.8 NA NA NA <NA> NA
7 NA NA 58 115 NA <NA> NA
8 NA NA 59 117 NA <NA> NA
9 NA NA 60 120 NA <NA> NA
10 NA NA 61 123 NA <NA> NA
11 NA NA 62 126 NA <NA> NA
12 NA NA 63 129 NA <NA> NA
13 NA NA 64 132 NA <NA> NA
14 NA NA 65 135 NA <NA> NA
15 NA NA 66 139 NA <NA> NA
16 NA NA 67 142 NA <NA> NA
17 NA NA 68 146 NA <NA> NA
18 NA NA 69 150 NA <NA> NA
19 NA NA 70 154 NA <NA> NA
20 NA NA 71 159 NA <NA> NA
Observe, how all three datasets were bonded or combined by the bind_rows() function in R. This is the beauty of bind_rows() function.
These 2 functions have endless applications in data manipulation in R programming.
The rbind() function in R and the bind_rows() function are the most useful functions when it comes to data manipulation.
You can easily bind two data frames of the same column count using rbind() function.
In the same way, if the data frames have unequal column counts, you can use the bind_rows() function along with dplyr package.
Well, That’s all for now, Happy binding!!!
More read: R documentation
]]>Hello people, today we will be looking at how to find the quantiles of the values using the quantile() function.
Quantile: In laymen terms, a quantile is nothing but a sample that is divided into equal groups or sizes. Due to this nature, the quantiles are also called as Fractiles. In the quantiles, the 25th percentile is called as lower quartile, 50th percentile is called as Median and the 75th Percentile is called as the upper quartile.
In the below sections, let’s see how this quantile() function works in R.
The syntax of the Quantile() function in R is,
quantile(x, probs = , na.rm = FALSE)
Where,
Well, hope you are good with the definition and explanations about quantile function. Now, let’s see how quantile function works in R with the help of a simple example which returns the quantiles for the input data.
#creates a vector having some values and the quantile function will return the percentiles for the data.
df<-c(12,3,4,56,78,18,46,78,100)
quantile(df)
Output:
0% 25% 50% 75% 100%
3 12 46 78 100
In the above sample, you can observe that the quantile function first arranges the input values in the ascending order and then returns the required percentiles of the values.
Note: The quantile function divides the data into equal halves, in which the median acts as middle and over that the remaining lower part is lower quartile and upper part is upper quartile.
NaN’s are everywhere. In this data-driven digital world, you may encounter these NaN’s more frequently, which are often called as the missing values. If your data by any means has these missing values, you can end up with getting the NaN’s in the output or the errors in the output.
So, in order to handle these missing values, we are going to use na.rm function. This function will remove the NA values from our data and returns the true values.
Let’s see how this works.
#creates a vector having values along with NaN's
df<-c(12,3,4,56,78,18,NA,46,78,100,NA)
quantile(df)
Output:
Error in quantile.default(df) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Oh, we got an error. If your guess is regarding the NA values, you are absolutely smart. If NA values are present in our data, the majority of the functions will end up in returning the NA values itself or the error message as mentioned above.
Well, let’s remove these missing values using the na.rm function.
#creates a vector having values along with NaN's
df<-c(12,3,4,56,78,18,NA,46,78,100,NA)
#removes the NA values and returns the percentiles
quantile(df,na.rm = TRUE)
Output:
0% 25% 50% 75% 100%
3 12 46 78 100
In the above sample, you can see the na.rm function and its impact on the output. The function will remove the NA’s to avoid the false output.
As you can see the probs argument in the syntax, which is showcased in the very first section of the article, you may wonder what does it mean and how it works?. Well, the probs argument is passed to the quantile function to get the specific or the custom percentiles.
Seems to be complicated? Dont worry, I will break it down to simple terms.
Well, whenever you use the function quantile, it returns the standard percentiles like 25,50 and 75 percentiles. But what if you want 47th percentile or maybe 88th percentile?
There comes the argument ‘probs’, in which you can specify the required percentiles to get those.
Before going to the example, you should know few things about the probs argment.
Probs: The probs or the probabilities argument should lie between 0 and 1.
Here is a sample which illustrates the above statement.
#creates the vector of values
df<-c(12,3,4,56,78,18,NA,46,78,100,NA)
#returns the quantile of 22 and 77 th percentiles.
quantile(df,na.rm = T,probs = c(22,77))
Output:
Error in quantile.default(df, na.rm = T, probs = c(22, 77)) :
'probs' outside [0,1]
Oh, it’s an error!
Did you get the idea, what happened?
Well, here comes the Probs statement. Even though we mentioned the right values in the probs argument, it violates the 0-1 condition. The probs argument should include the values which should lie in between 0 and 1.
So, we have to convert the probs 22 and 77 to 0.22 and 0.77. Now the input values is in between 0 and 1 right? I hope this makes sense.
#creates a vector of values
df<-c(12,3,4,56,78,18,NA,46,78,100,NA)
#returns the 22 and 77th percentiles of the input values
quantile(df,na.rm = T,probs = c(0.22,0.77))
Output:
22% 77%
10.08 78.00
Suppose you want your code to only return the percentiles and avoid the cut points. In these situations, you can make use of the ‘unname’ function.
The ‘unname’ function will remove the headings or the cut points ( 0%,25% , 50%, 75% , 100 %) and returns only the percentiles.
Let’s see how it works!
#creates a vector of values
df<-c(12,3,4,56,78,18,NA,46,78,100,NA)
quantile(df,na.rm = T,probs = c(0.22,0.77))
#avoids the cut-points and returns only the percentiles.
unname(quantile(df,na.rm = T,probs = c(0.22,0.77)))
Output:
10.08 78.00
Now, you can observe that the cut-points are disabled or removed by the unname function and returns only the percentiles.
We have discussed the round function in R in detail in the past article. Now, we are going to use the round function to round off the values.
Let’s see how it works!
#creates a vector of values
df<-c(12,3,4,56,78,18,NA,46,78,100,NA)
quantile(df,na.rm = T,probs = c(0.22,0.77))
#returns the round off values
unname(round(quantile(df,na.rm = T,probs = c(0.22,0.77))))
Output:
10 78
As you can see that our output values are rounded off to zero decimal points.
Till now, we have discussed the quantile function, its uses and applications as well as its arguments and how to use them properly.
In this section, we are going to get the quantiles for the multiple columns in a data set. Sounds interesting? follow me!
I am going to use the ‘mtcars’ data set for this purpose and also using the ‘dplyr’ library for this.
#reads the data
data("mtcars")
#returns the top few rows of the data
head(mtcars)
#install required paclages
install.packages('dplyr')
library(dplyr)
#using tapply, we can apply the function to multiple groups
do.call("rbind",tapply(mtcars$mpg, mtcars$gear, quantile))
Output:
0% 25% 50% 75% 100%
3 10.4 14.5 15.5 18.400 21.5
4 17.8 21.0 22.8 28.075 33.9
5 15.0 15.8 19.7 26.000 30.4
In the above process, we have to install the 'dplyr’ package, and then we will make use of tapply and rbind functions to get the multiple columns of the mtcars datasets.
In the above section, we took multiple columns such as ‘mpg’ and the ‘gear’ columns in mtcars data set. Like this, we can compute the quantiles for multiple groups in a data set.
My answer is a big YES!. The best plot for this will be a box plot. Let me take the iris dataset and will try to visualize the box plot which will showcase the percentiles as well.
Let’s roll!
data(iris)
head(iris)
This is the iris data set with top 6 values.
Let’s explore the data with the function named - ‘Summary’.
summary(iris)
In the above image, you can see the mean, median, 25th percentile(1 st quartile), 75 th percentile(3rd percentile) and min and max values as well. Let’s plot this information through a box plot.
Let’s do it!
#plots a boxplot with labels
boxplot(iris$Sepal.Length,main='The boxplot showing the percentiles',col='Orange',ylab='Values',xlab='Sepal Length',border = 'brown',horizontal = T)
A box plot can show many aspects of the data. In the below figure I have mentioned the particular values represented by the box plots. This will save some time for you and facilitates your understanding in the best way possible.
Well, it’s a longer article I reckon. And I tried my best to explain and explore the quantile() function in R in multiple dimensions through various examples and illustrations as well. The quantile function is the most useful function in data analysis as it efficiently reveals more information about the given data.
I hope you got a good understanding of the buzz around the quantile() function in R. That’s all for now. We will be back with more and more beautiful functions and topics in R programming. Till then take care and happy data analyzing!!!
More study: R documentation.
]]>So, let us begin!! :)
Feature Scaling is an essential step prior to modeling while solving prediction problems in Data Science. Machine Learning algorithms work well with the data that belongs to a smaller and standard scale.
This is when Normalization comes into picture. Normalization techniques enables us to reduce the scale of the variables and thus it affects the statistical distribution of the data in a positive manner.
In the subsequent sections, we will be having a look at some of the techniques to perform Normalization on the data values.
In the real world scenarios, to work with the data, we often come across situations wherein we find the datasets that are unevenly distributed. That is, they are either skewed or do not follow normalization of values.
In such cases, the easiest way to get values into proper scale is to scale them through the individual log values.
In the below example, we have scaled the huge data values present in the data frame ‘data’ using log() function from the R documentation.
Example:
rm(list = ls())
data = c(1200,34567,3456,12,3456,0985,1211)
summary(data)
log_scale = log(as.data.frame(data))
Output:
data
1 7.090077
2 10.450655
3 8.147867
4 2.484907
5 8.147867
6 6.892642
7 7.099202
Another efficient way of Normalizing values is through the Min-Max Scaling method.
With Min-Max Scaling, we scale the data values between a range of 0 to 1 only. Due to this, the effect of outliers on the data values suppresses to a certain extent. Moreover, it helps us have a smaller value of the standard deviation of the data scale.
In the below example, we have used ‘caret’ library to pre-process and scale the data. The preProcess() function
enables us to scale the value to a range of 0 to 1 using method = c('range')
as an argument. The predict()
method applies the actions of the preProcess() function on the entire data frame as shown below.
Example:
rm(list = ls())
data = c(1200,34567,3456,12,3456,0985,1211)
summary(data)
library(caret)
process <- preProcess(as.data.frame(data), method=c("range"))
norm_scale <- predict(process, as.data.frame(data))
Output:
data
1 0.03437997
2 1.00000000
3 0.09966720
4 0.00000000
5 0.09966720
6 0.02815801
7 0.03469831
In Standard scaling, also known as Standardization of values, we scale the data values such that the overall statistical summary of every variable has a mean value of zero and an unit variance value.
The scale() function
enables us to apply standardization on the data values as it centers and scales the
rm(list = ls())
data = c(1200,34567,3456,12,3456,0985,1211)
summary(data)
scale_data <- as.data.frame(scale(data))
Output:
As seen below, the mean value of the data frame before scaling is 6412. Whereas, after performing scaling of values, the mean has reduced to Zero.
Min. 1st Qu. Median Mean 3rd Qu. Max.
12 1092 1211 6412 3456 34567
V1
1 -0.4175944
2 2.2556070
3 -0.2368546
4 -0.5127711
5 -0.2368546
6 -0.4348191
7 -0.4167131
V1
Min. :-0.5128
1st Qu.:-0.4262
Median :-0.4167
Mean : 0.0000
3rd Qu.:-0.2369
Max. : 2.2556
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. For more such posts related to R programming, stay tuned with us!
Till then, Happy Learning!! :)
which(): The which function in R returns the position of the values in the logical vector.
which(x,arr.ind = F,useNames = F)
Where,
Well, you got the definition of which function along with its working nature. Now, let’s apply our learned things practically.
Let’s see how it works.
which(letters=="p")
16
which(letters=="n")
14
which(letters=="l")
12
“lettters” is a built-in constant with all the 26 letters of the English alphabet arranged serially.
The outputs that you see above, are representative of the position of each letter in the data frame. As you can see, the letter “p” is 16th in the alphabet, while “l” and “n” are 14th and 12th respectively.
Let’s now learn how to work with the which function.
Now, let’s create a vector in R language and and then by using the which function, let’s do the position tracking.
#creating a vector
df<- c(5,4,3,2,1)
#Postion of 5
which(df==5)
1
#Position of 1
which(df==1)
5
#Position of values greater than 2
which(df>2)
1 2 3
Great! The which() in R returns the position of the values in the given input. You can also use the function to pass specific conditions and get the positions of the output that match the conditions as we saw in the last example.
Now, let’s see how we can apply the which function with respect to the data frame in R langauge.
df<-datasets::BOD
df
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
For this purpose, we are using the BOD dataset which includes 2 columns namely, Time and Demand. It’s yet another built-in dataset.
Let’s use the which function and try to find the position of the values in the data.
which(df$demand=='10.3')
2
You can also input a list of values to the which() function. Look at the example below where I am trying to find the position of two values from the dataframe.
which(df$demand==c(8.3,16.0))
1 4
You can even use the which() function to find the column names in a data which contains numerical data.
Let’s see how it works in R language. For this, we are using the “Iris” dataset.
df<-datasets::iris
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
test<-which(sapply(df,is.numeric))
colnames(df)[test]
"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
The output shows that the iris dataset has 5 columns in it. Among them, 4 are numerical columns and 1 is categorical (Species).
We’ve used the sapply function in R along with the which() method here.
The which() function has returned only the names of numerical columns as per the input condition. If you a data analyst, then which() function will be invaluable for you.
Finally, we have arrived at the matrix in R. Well, you can use the which() function in R language to get the position of the values in a matrix. You will also get to know about the arr.index parameter in this section.
First things first - Create a matrix
df<-matrix(rep(c(1,0,1),4),nrow = 4)
df
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 1
[3,] 1 1 0
[4,] 1 0 1
Fantastic!!! You have just created a good looking matrix. Kudos. Let’s use the which() to get the position of the value ‘0’ in our matrix.
which(df==0,arr.ind = T)
row col
[1,] 2 1
[2,] 1 2
[3,] 4 2
[4,] 3 3
Well, the which function has returned the positions of the value ‘0’ in the matrix.
The first occurrence of “0” is in the second row first column. Then the next occurrence is in the first row, second column. Then we have 4th row, second column. And finally, the third row, third column.
The which() function in R is one of the most widely used function in data analysis and mining.
The function gives the position of the values in the data. If you are working with tons of data, then it will be hard to find specific values position in that and there comes which() in R.
That’s all for now. Happy positioning!!!
More read: R documentation
]]>The unique() function found its importance in the EDA (Exploratory Data Analysis) as it directly identifies and eliminates the duplicate values in the data.
In this article, we are going to unleash the various application of the unique() function in R programming. Let’s roll!!!
Well, before going into the topic, its good to know the idea behind it. In this case, it is unique values. The unique function will return the unique values by eliminating the duplicate counts.
The diagram tells you that the unique function will look for duplicates and eliminates that to return the unique values. There are many illustrations coming your way in the following sections to teach something good.
Unique: The unique() function is used to identify and eliminate the duplicate counts present in the data.
unique(x)
Where:
X = It can be a vector, a data frame or a matrix.
If you have a vector that has duplicate values, then with the help of the unique() function you can easily eliminate those using a single line of code.
Let’s see how it works…
#An input vector having duplicate values
df<-c(1,2,3,2,4,5,1,6,8,9,8,6)
#elimnates the duplicate values in the vector
unique(df)
Output = 1 2 3 4 5 6 8 9
In the above illustration you may observe that, the input vector has many duplicate values.
After we passed that vector to unique function, it eliminates all the duplicate values and returns only the unique values as shown above.
Now, we are going to find duplicate values present in a matrix and eliminate them using the unique function.
For this, we have to first create a matrix of ‘n’ rows and columns having the duplicate values.
To create a matrix, run the below code.
#creates a 6 x 4 matrix having 24 elements
df<-matrix(rep(1:20,length.out=24),nrow = 6,ncol=4,byrow = T)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
[6,] 1 2 3 4
As you can easily notice that, the last row is entirely duplicated. All you need to do is by using the unique() function, eliminate these duplicate values.
#removes the duplicate values
unique(df)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
YaY!
You did it! All the duplicate values present in the matrix were get removed by the unique function and it returned a matrix having unique values alone.
Till now, we worked on the vectors and the matrices to extract the unique values by eliminating the duplicate counts.
In this section, let’s focus on getting the unique values present in the data frame.
To create a data frame run the below code.
#creates a data frame
> Class_data<-data.frame(Student=c('Naman','Megh','Mark','Naman','Megh','Mark'),Age=c(22,23,24,22,23,24),Gender=c('Male','Female','Male','Male','Female','Male'))
#dataframe
Class_data
Student Age Gender
1 Naman 22 Male
2 Megh 23 Female
3 Mark 24 Male
4 Naman 22 Male
5 Megh 23 Female
6 Mark 24 Male
This is the data frame which has the duplicate counts as shown above. Let’s apply the unique function to get rid of the duplicate value present here.
unique(Class_data)
Student Age Gender
1 Naman 22 Male
2 Megh 23 Female
3 Mark 24 Male
Wow! The unique function returned all the unique values present in the dataframe by eliminating the duplicate values.
Just like this, by using the unique() function in R, you can easily get the unique values present in the data.
Yes, what if you are required to get the unique values out of a specific column instead of data set?
Worry not, using the unique() function we can also get the unique values out of particular column as shown below.
#creates a data frame
> Class_data<-data.frame(Student=c('Naman','Megh','Mark','Naman','Megh','Mark'),Age=c(22,23,24,22,23,24),Gender=c('Male','Female','Male','Male','Female','Male'))
#dataframe
Class_data
Student Age Gender
1 Naman 22 Male
2 Megh 23 Female
3 Mark 24 Male
4 Naman 22 Male
5 Megh 23 Female
6 Mark 24 Male
Okay, I am taking the same data frame that we used in the last sections for easy understanding.
Let’s use unique function to get rid of duplicate values.
unique(Class_data$Student)
Output = "Naman" "Megh" "Mark"
In the same way, we can also get the unique values in the Age or Gender columns as well.
unique(Class_data$Gender)
"Male" "Female"
In this section, we are going to get the count of the unique values in the data. This application is more useful to know your data better and get it ready for further analysis.
#importing the dataset
datasets::BOD
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
well, we are using the BOD dataset here. Let’s find the unique values first which will be followed by the count.
#returns the unique value
unique(BOD$demand)
Output = 8.3 10.3 19.0 16.0 15.6 19.8
Okay, now we have the unique values present in the demand column in the BOD dataset.
Now, we are good to go to find the count of the unique values.
#returns the length of unique values
length(unique(BOD$demand))
Output = 6
Well, the unique() function in R is a very valuable one when it comes to EDA (Exploratory Data Analysis).
It helps you to get a better understanding of your data along with particular counts.
This article tells you about the multiple applications and use cases of the unique() function. Happy analyzing!!!
More read: R documentation
]]>So, let us begin!!
Let us first understand the importance of error metrics in the domain of Data Science and Machine Learning!!
Error metrics enable us to evaluate the performance of a machine learning model on a particular dataset.
There are various error metric models depending upon the class of algorithm.
We have the Confusion Matrix to deal with and evaluate Classification algorithms. While R square is an important error metric to evaluate the predictions made by a regression algorithm.
R squared (R2)
is a regression error metric that justifies the performance of the model. It represents the value of how much the independent variables are able to describe the value for the response/target variable.
Thus, an R-squared model describes how well the target variable is explained by the combination of the independent variables as a single unit.
The R squared value ranges between 0 to 1 and is represented by the below formula:
R2= 1- SSres / SStot
Here,
Always remember, Higher the R square value, better is the predicted model!
In this example, we have implemented the concept of R square error metric on the Linear Regression model.
createDataPartition()
method.lm()
function and then we have called the user-defined R square function to evaluate the performance of the modelExample:
#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()
#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)
### SAMPLING OF DATA -- Splitting of Data columns into Training and Test dataset ###
categorical_col_updated = c('season','yr','mnth','weathersit','holiday')
library(dummies)
bike = bike_data
bike = dummy.data.frame(bike,categorical_col_updated)
dim(bike)
#Separating the depenedent and independent data variables into two dataframes.
library(caret)
set.seed(101)
split_val = createDataPartition(bike$cnt, p = 0.80, list = FALSE)
train_data = bike[split_val,]
test_data = bike[-split_val,]
### MODELLING OF DATA USING MACHINE LEARNING ALGORITHMS ###
#Defining error metrics to check the error rate and accuracy of the Regression ML algorithms
#1. MEAN ABSOLUTE PERCENTAGE ERROR (MAPE)
MAPE = function(y_actual,y_predict){
mean(abs((y_actual-y_predict)/y_actual))*100
}
#2. R SQUARED error metric -- Coefficient of Determination
RSQUARE = function(y_actual,y_predict){
cor(y_actual,y_predict)^2
}
##MODEL 1: LINEAR REGRESSION
linear_model = lm(cnt~., train_data) #Building the Linear Regression Model on our dataset
summary(linear_model)
linear_predict=predict(linear_model,test_data[-27]) #Predictions on Testing data
LR_MAPE = MAPE(test_data[,27],linear_predict) # Using MAPE error metrics to check for the error rate and accuracy level
LR_R = RSQUARE(test_data[,27],linear_predict) # Using R-SQUARE error metrics to check for the error rate and accuracy level
Accuracy_Linear = 100 - LR_MAPE
print("MAPE: ")
print(LR_MAPE)
print("R-Square: ")
print(LR_R)
print('Accuracy of Linear Regression: ')
print(Accuracy_Linear)
Output:
As seen below, the R square value is 0.82 i.e. the model has worked well for our data.
> print("MAPE: ")
[1] "MAPE: "
> print(LR_MAPE)
[1] 17.61674
> print("R-Square: ")
[1] "R-Square: "
> print(LR_R)
[1] 0.8278258
> print('Accuracy of Linear Regression: ')
[1] "Accuracy of Linear Regression: "
> print(Accuracy_Linear)
[1] 82.38326
We can even make use of the summary() function
in R to extract the R square value after modelling.
In the below example, we have applied the linear regression model on our data frame and then used summary()$r.squared
to get the r square value.
Example:
rm(list = ls())
A <- c(1,2,3,4,2,3,4,1)
B <- c(1,2,3,4,2,3,4,1)
a <- c(10,20,30,40,50,60,70,80)
b <- c(100,200,300,400,500,600,700,800)
data <- data.frame(A,B,a,b)
print("Original data frame:\n")
print(data)
ml = lm(A~a, data = data)
# Extracting R-squared parameter from summary
summary(ml)$r.squared
Output:
[1] "Original data frame:\n"
A B a b
1 1 1 10 100
2 2 2 20 200
3 3 3 30 300
4 4 4 40 400
5 2 2 50 500
6 3 3 60 600
7 4 4 70 700
8 1 1 80 800
[1] 0.03809524
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
Till then, Happy Learning!! :)
]]>CSV is expanded as Comma, Separated, Values. In this file, the values stored are separated by a comma. This process of storing the data is much easier.
Storing the data in an excel sheet is the most common practice in many companies. In the majority of firms, people are storing data as comma-separated-values (CSV), as the process is easier than creating normal spreadsheets. Later they can use R’s built in packages to read and analyze the data.
Being the most popular and powerful statistical analysis programming language, R offers specific functions to read data into organized data frames from a CSV file.
In this short example, we will see how we can read a CSV file into organized data frames.
The first thing in this process is to getting and setting up the working directory. You need to choose the working path of the CSV file.
Here you can check the default working directory using getwd() function and you can also change the directory using the function setwd().
>getwd() #Shows the default working directory
----> "C:/Users/Dell/Documents"
> setwd("C:\Users\Dell\Documents\R-test data") #to set the new working Directory
> getwd() #you can see the updated working directory
---> "C:/Users/Dell/Documents/R-test data"
After the setting of the working path, you need to import the data set or a CSV file as shown below.
> readfile <- read.csv("testdata.txt")
Execute the above line of code in R studio to get the data frame as shown below.
To check the class of the variable ‘readfile’, execute the below code.
> class(readfile)
---> "data.frame"
In the above image you can see the data frame which includes the information of student names, their ID’s, departments, gender and marks.
After getting the data frame, you can now analyse the data. You can extract particular information from the data frame.
To extract the highest marks scored by students,
>marks <- max(data$Marks.Scored) #this will give you the highest marks
#To extract the details of a student who scored the highest marks,
> data <- read.csv("traindata.csv")
> Marks <- max(data$Marks.Scored)
> retval <- subset(data, Marks.Scored == max(Marks.Scored)) #This will
extract the details of the student who secured highest marks
> View(retval)
To extract the details of the students who are in studying in ‘chemistry’ Dept,
> readfile <- read.csv("traindata.csv")
> retval <- subset( data, Department == "chemistry") # This will extract the student details who are in Biochemistry department
> View(retval)
By this process you can read the csv files in R with the use of read.csv(“ “) function. This tutorial covers how to import the csv file and reading the csv file and extracting some specific information from the data frame.
I used R studio for this project. RStudio offers great features like console, editor, and environment as well. Anyhow you are free to use other editors like Thinn-R, Crimson editor, etc. I hope this tutorial will help you in understanding the reading of CSV files in R and extracting some information from the data frame.
For more read: https://cran.r-project.org/manuals.html
]]>paste(): Takes multiple elements from the multiple vectors and concatenates them into a single element.
Along with paste() function, R has another function named paste0(). Yes, you heard it right.
paste0(): The paste0() function has space as its default separator and limits your opportunities in the output as well.
The syntax of the paste() function is,
paste(x,sep=" ", collapse=NULL)
Here:
The syntax of the paste0() function is,
paste(x,collapse=NULL)
Where,
A simple paste() will take multiple elements as inputs and concatenate those inputs into a single string. The elements will be separated by a space as the default option. But you can also change the separator value using the ‘sep’ parameter.
paste(1,'two',3,'four',5,'six')
Output = “1 two 3 four 5 six”
The separator parameter in the paste() function will deal with the value or the symbols which are used to separate the elements, which is taken as input by the paste() function.
paste(1,'two',3,'four',5,'six',sep = "_")
Output = “1_two_3_four_5_six”
paste(1,'two',3,'four',5,'six',sep = "&")
Output = “1&two&3&four&5&six”
When you pass a paste argument to a vector, the separator parameter will not work. Hence here comes the collapse parameter, which is highly useful when you are dealing with the vectors. It represents the symbol or values which separate the elements in the vector.
paste(c(1,2,3,4,5,6,7,8),collapse = "_")
Output = “1_2_3_4_5_6_7_8”
paste(c('Rita','Sam','John','Jat','Cook','Reaper'),collapse = ' and ')
Output = “Rita and Sam and John and Jat and Cook and Reaper”
Let’s see how separator and collapse arguments will work. The separator will deal with the values which are to be placed in between the set of elements and the collapse argument will make use of specific value to concatenate the elements into single -string.
paste(c('a','b'),1:10,sep = '_',collapse = ' and ')
Output = "a_1 and b_2 and a_3 and b_4 and a_5 and b_6 and a_7 and b_8 and a_9 and b_1
paste(c('John','Ray'),1:5,sep = '=',collapse = ' and ')
Output = “John=1 and Ray=2 and John=3 and Ray=4 and John=5”
Paste0() function acts just like paste function but with a default separator.
Let’s see how paste0() function works.
paste0('df',1:6)
Output = “df1” “df2” “df3” “df4” “df5” “df6”
You can see that the paste0() function has the default separator value. Now let’s see how paste0() function works with the collapse parameter.
The collapse argument in the paste0() function is the character, symbol, or a value used to separate the elements.
paste0('df',1:5,collapse = '_')
Output = “df1_df2_df3_df4_df5”
paste0('df',1:5,collapse = ' > ')
Output = “df1 > df2 > df3 > df4 > df5”
As you may observe the above results, the paste0() function returns a string with a default separator and a specified collapse argument as well.
You can also use the paste() function to paste the values or elements present in a data frame.
Let’s see how it works with the ‘BOD’ data set.
datasets::BOD
paste(BOD$Time,sep = ',',collapse = '_')
Output = “1_2_3_4_5_7”
datasets::BOD
paste(BOD$demand,sep = ',',collapse = '_')
Output = “8.3_10.3_19_16_15.6_19.8”
R offers numerous functions to make your analysis simpler but efficient. Among them the paste() function is very useful in concatenating the strings and the elements into a single string as well.
In this tutorial we have gone through various aspects of the paste() and paste0() functions. Both these will be really helpful in data analysis.
That’s all for now. Stay tuned for more R tutorials. Happy pasting!!!
More study:
]]>You may be a working professional, a programmer, or a novice learner, but there are some times where you required to read large datasets and analyze them.
It is really hard to digest a huge dataset which have 20+ columns or even more and have thousands of rows.
This article will address the head() and tail() functions in R, which returns the first and last n rows respectively.
Let’s quickly see what the head() and tail() methods look like
Head(): Function which returns the first n rows of the dataset.
head(x,n=number)
Tail(): Function which returns the last n rows of the dataset.
tail(x,n=number)
Where,
x = input dataset / dataframe.
n = number of rows that the function should display.
The head() function in R is used to display the first n rows present in the input data frame.
In this section, we are going to get the first n rows using head() function.
For this process, we are going to import a dataset ‘iris’ which is available in R studio by default.
#importing the dataset
df<-datasets::iris
#returns first n rows of the data
head(df)
You can see that the the head() function returned the first 6 rows present in the iris datatset.
By default, the head() function returns the first 6 rows by default.
But what if, you want to see the first 10, 15 rows if a dataset?
Well, you may observed in the syntax that you can pass the number argument to the head function to display specific number of rows.
Let’s see how it works.
#importing the data
df<-datasets::airquality
#returns first 10 rows
head(df,n=10)
You can now see the head() function returned the first 10 rows as specified by us in the input. You can also write the same query as head(df,10) and get the same results.
This is how head() function works.
Well, in the above sections, the head() function returned the whole set of values present in the first n rows of a dataset.
But do you know that the head() function in capable to returning the values of the particular column?
Yes, you read it right!
With a single piece of code, you can get the first n values of specified column.
#importing the data
df<-datasets::mtcars
#returns first 10 values in column 'mpg'
head(mtcars$mpg,10)
Output = 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
Just like the above sample, you can easily mention the required column name along with the required row count. That’s it.
The head() function will pierce into the data and returns the required.
The tail() function in the R is particularly used to display the last n rows of the dataset, in contrary to the head() function.
This section will illustrate the tail() function and its usage in R.
For this purpose, we are using ‘airquality’ dataset.
#importing the dataset
df<-datasets::airquality
#returns last n rows of the data
tail(df)
Well, in this output, you can see the last 6 rows of the iris dataset. This is what tail() function will do in R.
Similar to the head() function, the tail() function can return the last n rows of the specified count.
#importing the data
df<-datasets::airquality
#returns the last 10 values
tail(df,10)
Here you can see, that the tail() function has returned the last 10 rows as specified by us in the code.
The head() and tail() function does the same job in the quite opposite way.
You can use tail function to get last n values of a particular column as well.
Let’s see how it works!
#importing the data
df<-datasets::mtcars
#returns the last 10 values of column 'mpg'
tail(mtcars$mpg,10)
Output = 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
If you are able to get this output, congratulations! You have done it.
Just like this sample, you can specify the column name along with row count to get the required values.
The head() and tail() function in R are the most useful function when it comes to reading and analyzing the data.
You can get customized values through these functions as illustrated above. Simple syntax, effective results! - head() and tail() function in R.
That’s all for now, Happy analyzing!!!
More study: R documentation
]]>Standard deviation is very popular in the statistics, but why? the reasons for its popularity and its importance are listed below.
Before we roll into the topic, keep this definition in your mind!
Variance - It is defined as the squared differences between the observed value and expected value.
In this method, we will create a list ‘x’ and add some value to it. Then we can find the standard deviation of those values in the list.
x <- c(34,56,87,65,34,56,89) #creates list 'x' with some values in it.
sd(x) #calculates the standard deviation of the values in the list 'x'
Output —> 22.28175
Now we can try to extract specific values from the list ‘y’ to find the standard deviation.
y <- c(34,65,78,96,56,78,54,57,89) #creates a list 'y' having some values
data1 <- y[1:5] #extract specific values using its Index
sd(data1) #calculates the standard deviation for Indexed or extracted values from the list.
Output —> 23.28519
In this method, we are importing a CSV file to find the standard deviation in R for the values which are stored in that file.
readfile <- read.csv('testdata1.csv') #reading a csv file
data2 <- readfile$Values #getting values stored in the header 'Values'
sd(data2) #calculates the standard deviation
Output —> 17.88624
In general, The values will be so close to the average value in low standard deviation and the values will be far spread from the average value in the high standard deviation.
We can illustrate this with an example.
x <- c(79,82,84,96,98)
mean(x)
---> 82.22222
sd(x)
---> 10.58038
To plot these values in a bar graph using in R, run the below code.
To install the ggplot2 package, run this code in R studio.
-–> install.packages(“ggplot2”)
library(ggplot2)
values <- data.frame(marks=c(79,82,84,96,98), students=c(0,1,2,3,4,))
head(values) #displayes the values
marks students
1 79 0
2 82 1
3 84 2
4 96 3
5 98 4
x <- ggplot(values, aes(x=marks, y=students))+geom_bar(stat='identity')
x #displays the plot
In the above results, you can observe that most of the data is clustering around the mean value(79,82,84) which shows that it is a low standard deviation.
Illustration for high standard deviation.
y <- c(23,27,30,35,55,76,79,82,84,94,96)
mean(y)
---> 61.90909
sd(y)
---> 28.45507
To plot these values using a bar graph in ggplot in R, run the below code.
library(ggplot2)
values <- data.frame(marks=c(23,27,30,35,55,76,79,82,84,94,96), students=c(0,1,2,3,4,5,6,7,8,9,10))
head(values) #displayes the values
marks students
1 23 0
2 27 1
3 30 2
4 35 3
5 55 4
6 76 5
x <- ggplot(values, aes(x=marks, y=students))+geom_bar(stat='identity')
x #displays the plot
In the above results, you can see the widespread data. You can see the least score of 23 which is very far from the average score 61. This is called the high standard deviation
By now, you got a fair understanding of using the sd(’ ') function to calculate the standard deviation in the R language. Let’s sum up this tutorial by solving simple problems.
Find the standard deviation of the even numbers between 1-20 (exclude 1 and 20).
Solution: The even numbers between 1 to 20 are,
-–> 2, 4, 6, 8, 10, 12, 14, 16, 18
Lets find the standard deviation of these values.
x <- c(2,4,6,8,10,12,14,16,18) #list of even numbers from 1 to 20
sd(x) #calculates the standard deviation of these
values in the list of even numbers from 1 to 20
Output —> 5.477226
Find the standard deviation of the state-wise population in the USA.
For this, import the CSV file and read the values to find the standard deviation and plot the result in a histogram in R.
df<-read.csv("population.csv") #reads csv file
data<-df$X2018.Population #extarcts the data from population
column
mean(data) #calculates the mean
View(df) #displays the data
sd(data) #calculates the standard deviation
Output ----> mean = 6432008, Sd = 7376752
Finding the standard deviation of the values in R is easy. R offers standard function sd(’ ') to find the standard deviation. You can create a list of values or import a CSV file to find the standard deviation.
Important: Don’t forget to calculate the standard deviation by extracting some values from a file or a list through indexing as shown above.
Use the comment box to post any kind of doubts regarding the sd(’ ') function in R. Happy learning!!!
]]>Each row in the confusion matrix will represent the predicted values and columns will be responsible for actual values. This can also be vice-versa. Even though the matrixes are easy, the terminology behind them seems complex. There is always a chance to get confused about the classes. Hence the term - Confusion matrix
In most of the recourses, you could have seen the 2x2 matrix in R. But note that you can create a matrix of any number of class values. You can see the confusion matrix of two class and three class binary models below.
This is a two-class binary model shows the distribution of predicted and actual values.
This is a three-class binary model that shows the distribution of predicted and actual values of the data.
In the confusion matrix in R, the class of interest or our target class will be a positive class and the rest will be negative.
You can express the relationship between the positive and negative classes with the help of the 2x2 confusion matrix. It will include 4 categories -
In this section, we will use the demo number data which we are going to create here. Here, our interest/target class will be 0.
Let’s see how we can compute this using the confusion matrix. You can set the target class as 0 and observe the results.
It will be a bit confusing, but take your time and dig deep to get it better. Let’s do this using the caret library.
#Insatll required packages
install.packages('caret')
#Import required library
library(caret)
#Creates vectors having data points
expected_value <- factor(c(1,0,1,0,1,1,1,0,0,1))
predicted_value <- factor(c(1,0,0,1,1,1,0,0,0,1))
#Creating confusion matrix
example <- confusionMatrix(data=predicted_value, reference = expected_value)
#Display results
example
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 3 2
1 1 4
Accuracy : 0.7
95% CI : (0.3475, 0.9333)
No Information Rate : 0.6
P-Value [Acc > NIR] : 0.3823
Kappa : 0.4
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.7500
Specificity : 0.6667
Pos Pred Value : 0.6000
Neg Pred Value : 0.8000
Prevalence : 0.4000
Detection Rate : 0.3000
Detection Prevalence : 0.5000
Balanced Accuracy : 0.7083
'Positive' Class : 0
Woo!!! That’s cool. Now I am sure that things are pretty much clear at your end. This output alone can answer tons of questions that are rolling in your mind right now!
The success rate or the accuracy of the model can be easily calculated using the 2x2 confusion matrix. The formula for calculating accuracy is -
Here, the TP, TN, FP, AND FN will represent the particular value counts that belong to them. The accuracy will be calculated by summing and dividing the values as per the formulae.
After this, you are encouraged to find the error rate that our model has predicted wrongly. The formula for error rate is:
The error rate calculation is simple and to the point. If a model will perform at 90% accuracy then the error rate will be 10%. As simple as that.
The simple way to get the confusion matrix in R is by using the table() function. Let’s see how it works.
table(expected_value,predicted_value)
predicted_value
expected_value 0 1
0 3 1
1 2 4
Let me make it much more beautiful for you.
Perfect! Now you can observe the following points -
If you want to get more insights into the confusion matrix, you can use the ‘gmodel’ package in R.
Let’s install the package and see how it works. The gmodels package offer a customizable solution for the models.
#install required packages
install.packages('gmodels')
#import required library
library(gmodels)
#Computes the crosstable calculations
CrossTable(expected_value,predicted_value)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 10
| predicted_value
expected_value | 0 | 1 | Row Total |
---------------|-----------|-----------|-----------|
0 | 3 | 1 | 4 |
| 0.500 | 0.500 | |
| 0.750 | 0.250 | 0.400 |
| 0.600 | 0.200 | |
| 0.300 | 0.100 | |
---------------|-----------|-----------|-----------|
1 | 2 | 4 | 6 |
| 0.333 | 0.333 | |
| 0.333 | 0.667 | 0.600 |
| 0.400 | 0.800 | |
| 0.200 | 0.400 | |
---------------|-----------|-----------|-----------|
Column Total | 5 | 5 | 10 |
| 0.500 | 0.500 | |
---------------|-----------|-----------|-----------|
That’s amazing! You can see plenty of information that the gmodel library has returned based on the given data. It’s plenty of information right?
Finally, it’s time for some serious calculations using our confusion matrix. We have defined the formulas for achieving the accuracy and error rate.
Go for it!
Accuracy = (3 + 4) / (3+2+1+4)
0.7 = 70 %
The accuracy score reads as 70% for the given data and observations. Now, it’s straightforward that the error rate will be 30%, got it?
If not, we can go through our formula.
Error rate = (2+1) / (3+2+1+4)
0.30 = 30%
Cool! The model has wrongly predicted 30% of the values. The error rate is 30%.
This is also equal to the formula -
error rate = 1 - accuracy
1 - 0.70 = 0.30 = 30%
You can simple minus the accuracy value with 1 to get the error rate. Things are going pretty much easy though!
A confusion matrix is a table of values that represent the predicted and actual values of the data points. You can make use of the most useful R libraries such as caret, gmodels, and functions such as a table() and crosstable() to get more insights into your data.
A confusion matrix in R will be the key aspect of classification data problems. Try to apply all these above-illustrated techniques to your preferred dataset and observe the results.
That’s all for now. Happy R!!!
More read: R documentation
]]>