Part I – with R in Excel – Descriptive statistics | R-Bloggers

Part I – with R in Excel – Descriptive statistics | R-Bloggers

5 minutes, 38 seconds Read

[This article was first published on Adam’s Software Lab, and kindly contributed to R-bloggers]. (You can report problems here about the content on this page)


Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.

Introduction

The purpose of this series of messages is to demonstrate some use cases for R in Excel use of the Excelraddin Component (Disclaimer: I am the developer of this add -in: Excelraddin). The fundamental reason for the add-in is that access to the extensive R-ecosystem within an Excel worksheet makes it possible. Excel offers many excellent facilities for data that has arguments and analysis. For certain types of statistical data analysis, however, the limitations of the built-in functions are not sufficient in addition to the analysis tool suit and R offers superior facilities (for example for performing LDA, PCA, prediction and time series analysis to name just a few).

This series of messages shows four main areas where R is useful in Excel: Descriptive statisticslinear regressionpredictionAnd Access to Python. Along the way we will see that the use of R in Excel is no more difficult than writing a formula and calling the Excelraddin To evaluate it. The ‘trick’, if there is one, is unpacking the results in a form that Excel understands and that can be used in a worktop. We will see different examples of how to do this.

Installing and setting the Excelraddin is described here. Each part of the series is accompanied by an Excel workbook with the R scripts. The workbooks depend on the ExcelRAddIn-AddIn64.dllSo this must first be loaded and the “R-Addin” menu must appear on the right side of the menu bar.

The task window is empty until the first script has been evaluated. This initializes R with the folders in the “settings”. The standard packages that I use are Tidyverse, DPLYR, Forecast, GGPLOT2, GGTHEMES, as shown below:

R Environmental institutions

The workbook for this part of the series is: Part I – R in Excel – Descriptive Statistics.xlsx. The workbooks all have a similar structure to keep things organized. The ‘References’ worksheet contains all links to external references. The worksheet ‘Libraries’ loads extra (non-default) packages. The ‘Datasets’ worksheet contains all the data that is referred to in the worksheets.

Descriptive statistics

Load data

The first step is to load some data. This data set comes out “Linear models with R” By Julian Faraway. In this example I loaded the data in Excel from a CSV file (Galapagosdata.csv) With the help of the Power Query. The data has been cleaned up and I made a table (tableGalapagos) which can be referred to in the workbook.

In the descriptive statistics, we first make a data frame using the CreateDataFrame function.

=RScript.CreateDataFrame("galapagos", tableGalapagosData, tableGalapagosData[#Headers])

This function is part of the add -in and simplifies the creation of data frames. There are also functions to make vectors and matrices. We pass on a name (which appears in the R environment) and the data and headers. The final parameter (‘type’ => sign, complex, whole number, logical, numeric) is optional; The RTYPE is now determined from the data if possible. This makes it somewhat easier to make objects to pass by Excel to R.

Frame

This copies the data to the R environment. There are a number of alternatives to this approach. We could have loaded the CSV file directly in R using:

galapagos <- read.csv("D:\Development\...\GalapagosData.csv")

By loading the data in Excel and then copying to R, we can use the Excel import using Power Query, so we automatically get grouping, filtering, etc. and we can immediately make turnables. The disadvantage is that we have to make a copy to R, and this means that the data types are ‘viable’. This is especially important with dates.

Obtaining statistics

Now that we have the data frame in R (and in Excel), we can obtain some descriptive statistics. If we did this exclusively in Excel, we can use individual formulas (=COUNT()=AVERAGE()=STDEV.S() and so forth). With the help of R we can achieve the same.

Basic statistics

As expected, this returns the average and the standard deviation.

We can improve this using some extra R -functions: sapply along with fivenum ((fit Tukey’s five number of summary (minimum, lower hinge, median, upper hinge, maximum) for the input data) returns.

as.data.frame(sapply(galapagos[,2:8], fivenum))

In the event of evaluated, this is all to fivenum Function to columns 2 to 8 (based on zero) from the Galapagos data set and the result in a data frame. If we don’t do this, we don’t get the column heads back (which is not very useful). You may also have noticed that we must determine the return values of the documentation. There seems to be no way to pick up these metadata from the position.

An alternative to the summary of five numbers is the use of the built -in in summary(...) function. Unfortunately, the output of this does not work well with Excel, so we have to massage the results to obtain a decent table that shows the summary with labels.

Summary function

In short, we get the column labels of the Galapagos -data set using: names(galapagos)And we get the labels for the summary using: names(summary(galapagos$Species)). We then ask the summary for each column of the data in which we are interested. For example: summary(galapagos$Elevation).

Now that we massage the results, we can even consider using an adapted function. For example, we can define a function that controls a data frame that consists of a label and the corresponding statistics:

custom_summary <- function(data) {
  label <- c("count", "mean", "std.dev")
  value <- c(length(data), mean(data), sd(data))
  data.frame(label, value)
}

Evaluation of the function with the script:

custom_summary(galapagos$Area)

performs a small table as follows:

Adapted function

If all this seems a lot of hard work, or if you are looking for a more advanced approach – perhaps a summary that tests the normality of the data distribution – then you might prefer to use a summarizing function of another package. There are several to choose from. Here we use pastecs Pastecs And summarytools Summary. Both give good results with minimal effort:

stats.desc function of Pastecs

Pack

In this message we have seen various approaches to obtain descriptive statistics using R in Excel via the Excelraddin. We have introduced some basic approaches (similar to what Excel Native offers). But we have also seen some more advanced use that shows how useful it can be to have access to R -functionality in Excel. The following message deals with linear regression with R in Excel.


#Part #Excel #Descriptive #statistics #RBloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *