ReinforcementLearning: A package for replicating human behavior in R

Nicolas Proellochs and Stefan Feuerriegel



Reinforcement learning has recently gained a great deal of traction in studies that call for human-like learning. In settings where an explicit teacher is not available, this method teaches an agent via interaction with its environment without any supervision other than its own decision-making policy. In many cases, this approach appears quite natural by mimicking the fundamental way humans learn. However, implementing reinforcement learning is programmatically challenging, since it relies on continuous interactions between an agent and its environment. In fact, there is currently no package available that performs model-free reinforcement learning in R. As a remedy, we introduce the ReinforcementLearning R package, which allows an agent to learn optimal behavior based on sample experience consisting of states, actions and rewards. The result of the learning process is a highly interpretable reinforcement learning policy that defines the best possible action in each state.

In the following sections, we present multiple step-by-step examples to illustrate how to take advantage of the capabilities of the ReinforcementLearning package. Moreover, we present methods to customize the learning and action selection behavior of the agent. Main features of ReinforcementLearning include, but are not limited to,

  • Learning an optimal policy from a fixed set of a priori known transition samples
  • Predefined learning rules and action selection modes
  • A highly customizable framework for model-free reinforcement learning tasks

Reinforcement learning

Reinforcement learning refers to the problem of an agent that aims to learn optimal behavior through trial-and-error interactions with a dynamic environment. All algorithms for reinforcement learning share the property that the feedback of the agent is restricted to a reward signal that indicates how well the agent is behaving. In contrast to supervised machine learning methods, any instruction concerning how to improve its behavior is absent. Thus, the goal and challenge of reinforcement learning is to improve the behavior of an agent given only this limited type of feedback.

The reinforcement learning problem

In reinforcement learning, the decision-maker, i.e. the agent, interacts with an environment over a sequence of observations and seeks a reward to be maximized over time. Formally, the model consists of a finite set of environment states S, a finite set of agent actions A, and a set of scalar reinforcement signals (i.e. rewards) R. At each iteration i, the agent observes some representation of the environment’s state si ∈ S. On that basis, the agent selects an action ai ∈ A(si), where A(si) ⊆ A denotes the set of actions available in state si. After each iteration, the agent receives a numerical reward ri+1 ∈ R and observes a new state si+1.

Policy learning

In order to store current knowledge, the reinforcement learning method introduces a so-called state-action function Q(si,ai), which defines the expected value of each possible action ai in each state si. If Q(si,ai) is known, then the optimal policy π*(si,ai) is given by the action ai, which maximizes Q(si,ai) given the state si. Consequently, the learning problem of the agent is to maximize the expected reward by learning an optimal policy function π*(si,ai).

Experience replay

Experience replay allows reinforcement learning agents to remember and reuse experiences from the past. The underlying idea is to speed up convergence by replaying observed state transitions repeatedly to the agent, as if they were new observations collected while interacting with a system. Hence, experience replay only requires input data in the form of sample sequences consisting of states, actions and rewards. These data points can be, for example, collected from a running system without the need for direct interaction. The stored training examples then allow the agent to learn a state-action function and an optimal policy for every state transition in the input data. In a next step, the policy can be applied to the system for validation purposes or to collect new data points (e.g. in order to iteratively improve the current policy). As its main advantage, experience replay can speed up convergence by allowing for the back-propagation of information from updated states to preceding states without further interaction.

Setup of the ReinforcementLearning package

Even though reinforcement learning has recently gained a great deal of traction in studies that perform human-like learning, the available tools are not living up to the needs of researchers and practitioners. The ReinforcementLearning package is intended to partially close this gap and offers the ability to perform model-free reinforcement learning in a highly customizable framework.


Using the devtools package, one can easily install the latest development version of ReinforcementLearning as follows.


# Option 1: download and install latest version from GitHub

# Option 2: install directly from bundled archive

Package loading

Afterwards, one merely needs to load the ReinforcementLearning package as follows.



The following sections present the usage and main functionality of the ReinforcementLearning package.

Data format

The ReinforcementLearning package uses experience replay to learn an optimal policy based on past experience in the form of sample sequences consisting of states, actions and rewards. Here each training example consists of a state transition tuple (s,a,r,s_new), as described ibelow.

  • s The current environment state.
  • a The selected action in the current state.
  • r The immediate reward received after transitioning from the current state to the next state.
  • s_new The next environment state.


  • The input data must be a dataframe in which each row represents a state transition tuple (s,a,r,s_new).

Read-in sample experience

The state transition tuples can be collected from an external source and easily read-in into R. The sample experience can then be used to train a reinforcement learning agent without requiring further interaction with the environment. The following example shows a representative dataset containing game states of 100,000 randomly sampled Tic-Tac-Toe games.

head(tictactoe, 5)
##       State Action NextState Reward
## 1 .........     c7 ......X.B      0
## 2 ......X.B     c6 ...B.XX.B      0
## 3 ...B.XX.B     c2 .XBB.XX.B      0
## 4 .XBB.XX.B     c8 .XBBBXXXB      0
## 5 .XBBBXXXB     c1 XXBBBXXXB      0

Experience sampling using an environment function

The ReinforcementLearning package is shipped with the built-in capability to sample experience from a function that defines the dynamics of the environment. If the dynamics of the environment are known a priori, one can set up an arbitrary complex environment function in R and sample state transition tuples. This function has to be manually implemented and must take a state and an action as input. The return value must be a list containing the name of the next state and the reward. As a main advantage, this method of experience sampling allows one to easily validate the performance of reinforcement learning, by applying the learned policy to newly generated samples.

environment <- function(state, action) {
  return(list("NextState" = newState,
              "Reward" = reward))

The following example illustrates how to generate sample experience using an environment function. Here we collect experience from an agent that navigates from a random starting position to a goal position on a simulated 2×2 grid (see figure below).

| s1  | s4  |
| s2     s3 |

Each cell on the grid represents one state, which yields a total of 4 states. The grid is surrounded by a wall, which makes it impossible for the agent to move off the grid. In addition, the agent faces a wall between s1 and s4. At each state, the agent randomly chooses one out of four possible actions, i. e. to move up, down, left, or right. The agent encounters the following reward structure: Crossing each square on the grid leads to a reward of -1. If the agent reaches the goal position, it earns a reward of 10.

# Load exemplary environment (gridworld)
env <- gridworldEnvironment
## function(state, action) {
##   next_state <- state
##   if(state == state("s1") && action == "down") next_state <- state("s2")
##   if(state == state("s2") && action == "up") next_state <- state("s1")
##   if(state == state("s2") && action == "right") next_state <- state("s3")
##   if(state == state("s3") && action == "left") next_state <- state("s2")
##   if(state == state("s3") && action == "up") next_state <- state("s4")
##   if(next_state == state("s4") && state != state("s4")) {
##     reward <- 10
##   } else {
##     reward <- -1
##   }
##   out <- list("NextState" = next_state, "Reward" = reward)
##   return(out)
## }
## <environment: namespace:ReinforcementLearning>
# Define state and action sets
states <- c("s1", "s2", "s3", "s4")
actions <- c("up", "down", "left", "right")

# Sample N = 1000 random sequences from the environment
data <- sampleExperience(N=1000, env = env, states = states, actions = actions)
##   State Action Reward NextState
## 1    s4   left     -1        s4
## 2    s2  right     -1        s3
## 3    s2  right     -1        s3
## 4    s3   left     -1        s2
## 5    s4     up     -1        s4
## 6    s1   down     -1        s2

Performing reinforcement learning

The following example shows how to teach a reinforcement learning agent using input data in the form of sample sequences consisting of states, actions and rewards. The ‘data’ argument must be a dataframe object in which each row represents a state transition tuple (s,a,r,s_new). Moreover, the user is required to specify the column names of the individual tuple elements in ‘data’.

# Define reinforcement learning parameters
control <- list(alpha = 0.1, gamma = 0.5, epsilon = 0.1)

# Perform reinforcement learning
model <- ReinforcementLearning(data, s = "State", a = "Action", r = "Reward", s_new = "NextState", control = control)

# Print result
## State-Action function Q
##         right         up       down       left
## s1 -0.6633782 -0.6687457  0.7512191 -0.6572813
## s2  3.5806843 -0.6893860  0.7760491  0.7394739
## s3  3.5702779  9.1459425  3.5765323  0.6844573
## s4 -1.8005634 -1.8567931 -1.8244368 -1.8377018
## Policy
##      s1      s2      s3      s4 
##  "down" "right"    "up" "right" 
## Reward (last iteration)
## [1] -263

The result of the learning process is a state-action table and an optimal policy that defines the best possible action in each state.

# Print policy
##      s1      s2      s3      s4 
##  "down" "right"    "up" "right"

Updating an existing policy

Specifying an environment function to model the dynamics of the environment allows one to easily validate the performance of the agent. In order to do this, one simply applies the learned policy to newly generated samples. For this purpose, the ReinforcementLearning package comes with an additional predefined action selection mode, namely ‘epsilon-greedy’. In this strategy, the agent explores the environment by selecting an action at random with probability ε. Alternatively, the agent exploits its current knowledge by choosing the optimal action with probability 1-ε.

The following example shows how to sample new experience from an existing policy using ‘epsilon-greedy’ action selection. The result is new state transition samples (‘data_new’) with significantly higher rewards compared to the original sample (‘data’).

# Define reinforcement learning parameters
control <- list(alpha = 0.1, gamma = 0.5, epsilon = 0.1)

# Sample N = 1000 sequences from the environment using epsilon-greedy action selection
data_new <- sampleExperience(N = 1000, env = env, states = states,  actions = actions, model = model, actionSelection = "epsilon-greedy", control = control)
##   State Action Reward NextState
## 1    s2  right     -1        s3
## 2    s4  right     -1        s4
## 3    s4  right     -1        s4
## 4    s4  right     -1        s4
## 5    s2  right     -1        s3
## 6    s1   down     -1        s2
# Update the existing policy using new training data
model_new <- ReinforcementLearning(data_new, s = "State", a = "Action", r = "Reward", s_new = "NextState", control = control, model = model)

# Print result
## State-Action function Q
##        right         up       down       left
## s1 -0.643587 -0.6320560  0.7657318 -0.6314927
## s2  3.530829 -0.6407675  0.7714129  0.7427914
## s3  3.548196  9.0608344  3.5521760  0.7382102
## s4 -1.939574 -1.8922783 -1.8835278 -1.8856132
## Policy
##      s1      s2      s3      s4 
##  "down" "right"    "up"  "down" 
## Reward (last iteration)
## [1] 1211
## Model details
## Learning rule:           experienceReplay
## Learning iterations:     2
## Number of states:        4
## Number of actions:       4
## Total Reward:            1211
## Reward details (per iteration)
## Min:                     -263
## Max:                     1211
## Average:                 474
## Median:                  474
## Standard deviation:      1042.275
# Plot reinforcement learning curve


Parameter configuration

The ReinforcementLearning package allows for the adjustment of the following parameters in order to customize the learning behavior of the agent.

  • alpha The learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated and, hence, nothing is learned. Setting a high value, such as 0.9, means that learning can occur quickly.
  • gamma Discount factor, set between 0 and 1. Determines the importance of future rewards. A factor of 0 will render the agent short-sighted by only considering current rewards, while a factor approaching 1 will cause it to strive for a greater reward over the long term.
  • epsilon Exploration parameter, set between 0 and 1. Defines the exploration mechanism in ε-greedy action selection. In this strategy, the agent explores the environment by selecting an action at random with probability ε. Alternatively, the agent exploits its current knowledge by choosing the optimal action with probability 1-ε. This parameter is only required for sampling new experience based on an existing policy.
  • iter Number of repeated learning iterations. Iter is an integer greater than 0. The default is set to 1.
# Define control object
control <- list(alpha = 0.1, gamma = 0.1, epsilon = 0.1)

# Pass learning parameters to reinforcement learning function
model <- ReinforcementLearning(data, iter = 10, control = control)

Working example: Learning Tic-Tac-Toe

The following example shows the use of ReinforcementLearning in an applied setting. More precisely, we utilize a dataset containing 406,541 game states of Tic-Tac-Toe to learn the optimal actions for each state of the board.

# Load dataset

# Define reinforcement learning parameters
control <- list(alpha = 0.2, gamma = 0.4, epsilon = 0.1)

# Perform reinforcement learning
model <-ReinforcementLearning(tictactoe, s = "State", a = "Action", r = "Reward", s_new = "NextState", iter = 1, control = control)

# Print optimal policy


  • All states are observed from the perspective of player X, who is also assumed to have played first.
  • The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game. Reward for player X is +1 for ‘win’, 0 for ‘draw’, and -1 for ‘loss’.


Unit Testing in R

Software testing describes several means to investigate program code regarding its quality. The underlying approaches provides means to handle errors once they occur. Furthermore, software testing also show techniques to reduce the probability of that.

R is becoming a increasingly promiment programming language. This not only includes pure statistical settings but also machine learning, dashboards via Shiny and beyond. This development is simulateneously fueled by the business schools teaching R to their students. While software testing is usually covered from a theoretical viewpoint, our slides teach the basics on software testing in an easy-to-understand fashion with the help of R.

Our slide deck aims at bridging R programming and software testing. The slides outline the need for software testing and describe general approaches, such as the V model. In addition, we present the build-in features for error handling in R and also show how to do unit testing with the help of the “testthat” package.

We hope that the slide deck supports practitioners to unleash the power of unit testing in R. Moreover, it should equip scholars in business schools with knowledge on software testing.

Download the slides here

The content was republished on with permission.

Ensemble Learning in R

Previous research in data mining has devised numerous different algorithms for learning tasks. While an individual algorithm might already work decently, one can usually obtain a better predictive by combining several. This approach is referred to as ensemble learning.
Common examples include random forests, boosting and AdaBost in particular.

Our slide deck is positioned at the intersection of teaching the basic idea of ensemble learning and providing practical insights in R.
Therefore, each algorithm comes with an easy-to-understand explanation on how to use it in R.

We hope that the slide deck enables practitioners to quickly adopt ensemble learning for their applications in R. Moreover, the materials might lay the groundwork for courses on data mining and machine learning.

Download the slides here
Download the exercise sheet here
The content was republished on with permission.

Reinforcement Learning in R

Reinforcement learning has gained considerable traction as it mines real experiences with the help of trial-and-error learning to model decision-making. Thus, this approach attempts to imitate the fundamental method used by humans of learning optimal behavior without the requirement of an explicit model of the environment. In contrast to many other approaches from the domain of machine learning, reinforcement learning works well with learning tasks of arbitrary length and can be used to learn complex strategies for many scenarios, such as robotics and game playing.

Our slide deck is positioned at the intersection of teaching the basic idea of reinforcement learning and providing practical insights into R. While existing packages, such as MDPtoolbox, are well suited to tasks that can be formulated as a Markov decision process, we also provide practical guidance regarding how to set up reinforcement learning in more vague environments. Therefore, each algorithm comes with an easy-to-understand explanation of how to use it in R.

We hope that the slide deck enables practitioners to quickly adopt reinforcement learning for their applications in R. Moreover, the materials might lay the groundwork for courses on human decision-making and machine learning.

Download the slides here

Download the exercise sheet here (solutions are available on request)

Sentiment Analysis in R

Current research in finance and the social sciences utilizes sentiment analysis to understand human decisions in response to textual materials. While sentiment analysis has received great traction lately, the available tools are not yet living up to the needs of researchers. Especially R has not yet capabilities that most research desires.

Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as General Inquirer, Harvard IV or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogeneous response variable.

This immediately reveals manifold implications for practitioners, as well as those involved in the fields of finance research and the social sciences: researchers can use R to extract text components that are relevant for readers and test their hypotheses on this basis. By the same token, practitioners can measure which wording actually matters to their readership and enhance their writing accordingly. We demonstrate the added benefits in two case studies drawn from finance and the social sciences.


Optimization and Operations Research in R

Authors: Stefan Feuerriegel and Joscha Märkle-Huß

R is widely taught in business courses and, hence, known by most data scientists with business background. However, when it comes to optimization and Operations Research, many other languages are used. Especially for optimization, solutions range from Microsoft Excel solvers to modeling environments such as Matlab and GAMS. Most of these are non-free and require students to learn yet another language. Because of this, we propose to use R in optimization problems of Operations Research, since R is open source, comes for free and is widely known. Furthermore, R provides a multitude of numerical optimization packages that are readily available. At the same time, R is widely used in industry, making it a suitable and skillful tool to lever the potential of numerical optimization.

The materials starts with a review of numerical and linear algebra basics for optimization. Here, participants learn how to derive a problem statement that is compatible with solving algorithms. This is followed by an overview on problem classessuch as one and multi-dimensional problems. Starting with linear and quadratic algorithms, we also cover convex optimization, followed by non-linear approaches such as gradient based (gradient descent methods), Hessian based (Newton and quasi-Newton methods) and non-gradient based (Nelder-Mead). We finally demonstrate the potent capabilities of R for Operations Research: we show how to solve optimization problems in industry and business, as well as illustrate the use in methods for statistics and data mining (e.g. quantile regression). All examples are supported by appropriate visualizations.



1 – Motivation
2 – Introduction to R
3 – Advanced R
4 – Numerical Analysis


1 – Homework
2 – Homework
3 – Homework
4 – Homework

caffeR: an R wrapper for ‘caffe’

Authors: Christof Naumzik & Stefan Feuerriegel

Caffe ( provides a powerful framework for deep learning. It is developed and maintained by the Berkeley Vision and Learning Center (BVLC) and has received a great deal of traction lately.

Caffe enables users to define and train custom-made neural networks without hard-coding. Furthermore, it allows users to execute all computations on CPUs as well as GPUs. Recent research has created a vast zoo of models. This rich prevalence of existing models makes it easy for users to leverage pre-trained neural networks that are known to perform well in various machine learning tasks.

While caffe already offers Matlab and Python interfaces, R is not currently supported. Our package caffeR aims at providing wrapper functions that allow its users to run caffe from R. These include data preprocessing and setup of networks, as well as monitoring and evaluation of training processes. For this purpose, caffeR prepares the correct configuration files and then passes routine calls directly to caffe.

Download of caffeR via GitHub:

Deep Learning in R



Deep learning is a recent trend in machine learning that models highly non-linear representations of data. In the past years, deep learning has gained a tremendous momentum and prevalence for a variety of applications (Wikipedia 2016a). Among these are image and speech recognition, driverless cars, natural language processing and many more. Interestingly, the majority of mathematical concepts for deep learning have been known for decades. However, it is only through several recent developments that the full potential of deep learning has been unleashed (Nair and Hinton 2010; Srivastava et al. 2014).

Previously, it was hard to train artificial neural networks due to vanishing gradients and overfitting problems. Both problems are now solved by using different activation functions, dropout regularization and a massive amount of training data. For instance, the Internet can nowadays be utilized to retrieve large volumes of both labeled and unlabeled data. In addition, the availability of GPUs and GPGPUs has made computations much cheaper and faster.

Today, deep learning has shown itself to be very effective for almost any task which requires machine learning. However, it is particularly suited to complex, hierarchical data. Its underlying artificial neural network models highly non-linear representations; these are usually composed of multiple layers together with non-linear transformations and tailored architectures. A typical representation of a deep neural network is depicted in Figure 1.

Figure 1. Model of a deep neural network.

The success of deep learning has led to a wide range of frameworks and libraries for various programming languages. Examples include Caffee, Theano, Torch and Tensor Flow, amongst others. This blog entry aims to provide an overview and comparison of different deep learning packages available for the programming language R. We compare performance and ease-of-use across different datasets.

Packages for deep learning in R

The R programming language has gained considerable popularity among statisticians and data miners for its ease-of-use, as well as its sophisticated visualizations and analyses. With the advent of the deep learning era, the support for deep learning in R has grown ever since, with an increasing number of packages becoming available. This section presents an overview on deep learning in R as provided by the following packages: MXNetR, darch, deepnet, H2O and deepr.

First of all, we note that the underlying learning algorithms greatly vary from one package to another. As such, Table 1 shows a list of the available methods/architectures in each of the packages.

Table 1. List of available deep learning methods across the R packages.

Package Available architectures of neural networks
MXNetR Feed-forward neural network, convolutional neural network (CNN)
darch Restricted Boltzmann machine, deep belief network
deepnet Feed-forward neural network, restricted Boltzmann machine, deep belief network, stacked autoencoders
H2O Feed-forward neural network, deep autoencoders
deepr Simplify some functions from H2O and deepnet packages

Package “MXNetR”

The MXNetR package is an interface of the MXNet library written in C++. It contains feed-forward neural networks and convolutional neural networks (CNN) (MXNetR 2016a). It also allows one to construct customized models. This package is distributed in two versions: CPU only or GPU version. The former CPU version can be easily installed directly from inside R, whereas the latter GPU version depends on 3rd party libraries like cuDNN and requires building the library from its source code (MXNetR 2016b).

A feed-forward neural network (multi-layer perceptron) can be built in MXNetR with the function call:

mx.mlp(data, label, hidden_node=1, dropout=NULL, activation=”tanh”, out_activation=”softmax”, device=mx.ctx.default(),…)

The parameters are as follows:

  • data – input matrix
  • label – training labels
  • hidden_node – a vector containing the number of hidden nodes in each hidden layer
  • dropout – a number in [0,1) containing the dropout ratio from the last hidden layer to the output layer
  • activation – either a single string or a vector containing the names of activation functions. Valid values are {'relu', 'sigmoid', 'softrelu', 'tanh'}
  • out_activation – a single string containing the name of the output activation function. Valid values are {'rmse', 'sofrmax', 'logistic'}
  • device – whether to train on mx.cpu (default) or mx.gpu
  • ... – other parameters passing to mx.model.FeedForward.create

Function mx.model.FeedForward.create is used internally in mx.mpl and takes the following parameters:

  • symbol – the symbolic configuration of the neural network
  • y – array of labels
  • x – training data
  • ctx – context, i.e. a device (CPU/GPU) or list of devices (multiple CPUs or GPUs)
  • num.round – number of iterations to train the model
  • optimizer – string (default is 'sgd')
  • initializer – initialization scheme for parameters
  • – validation set used during the process
  • eval.metric – evaluation function on the results
  • epoch.end.callback – callback when iteration ends
  • batch.end.callback – callback when one mini-batch iteration ends
  • array.batch.size – batch size used for array training
  • array.layout – can be {'auto', 'colmajor', 'rowmajor'}
  • kvstore – synchronization scheme for multiple devices

Sample call:

model <- mx.mlp(train.x, train.y, hidden_node=c(128,64), out_node=2, activation="relu", out_activation="softmax",num.round=100, array.batch.size=15, learning.rate=0.07, momentum=0.9, device=mx.cpu())

To use the trained model afterwards, we simply need to invoke the function predict() specifying the model as the first parameter and testset as the second:

preds = predict(model, testset)

The function mx.mlp() is essentially a proxy to the more flexible but lengthy process of defining a neural network by using ‘Symbol’ system of MXNetR. The equivalent of the previous network in symbolic definition will be:

data <- mx.symbol.Variable("data") fc1 <- mx.symbol.FullyConnected(data, num_hidden=128) act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu") fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64) act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu") fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=2) lro <- mx.symbol.SoftmaxOutput(fc3, name="sm") model2 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y, ctx=mx.cpu(), num.round=100, array.batch.size=15, learning.rate=0.07, momentum=0.9)

When the architecture of the network is finally created, MXNetR provides a simple way to graphically inspect it using the following function call:





Here, the parameter is the trained model represented by the symbol. The first network is constructed by mx.mlp() and the second using the symbol system.

The definition goes layer-by-layer from input to output, while also allowing for a different number of neurons and specific activation functions for each layer separately. Additional options are available via mx.symbol: mx.symbol.Convolution, which applies convolution to the input and then adds a bias. It can create convolutional neural networks. The reverse is mx.symbol.Deconvolution, which is usually used in segmentation networks along with mx.symbol.UpSampling in order to reconstruct the pixel-wise classification of an image. Another type of layer used in CNNs is mx.symbol.Pooling; this essentially reduces the data by usually picking signals with the highest response. The layer mx.symbol.Flatten is needed to link convolutional and pooling layers to a fully connected network. Additionally, mx.symbol.Dropout can be used to cope with the overfitting problem. It takes as a parameter previous_layer and a float value fraction of the input that is dropped.

As we can see, MXNetR can be used for quick design of standard multi-layer perceptrons with the function mx.mlp() or for more extensive experiments regarding symbolic representation.

Example of LeNet network:

data <- mx.symbol.Variable('data') conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20) tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh") pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max", kernel=c(2,2), stride=c(2,2)) conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50) tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh") pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max", kernel=c(2,2), stride=c(2,2)) flatten <- mx.symbol.Flatten(data=pool2) fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500) tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh") fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10) lenet <- mx.symbol.SoftmaxOutput(data=fc2) model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y, ctx=device.cpu, num.round=5, array.batch.size=100, learning.rate=0.05, momentum=0.9)

Altogether, the MXNetR package is highly flexible, while supporting both multiple CPUs and multiple GPUs. It has a shortcut to build standard feed-forward networks, but also grants flexible functionality to build more complex, customized networks such as CNN LeNet.

Package “darch”

The darch package (darch 2015) implements the training of deep architectures, such as deep belief networks, which consist of layer-wise pre-trained restricted Boltzmann machines. The package also entails backpropagation for fine-tuning and, in the latest version, makes pre-training optional.

Training of a Deep Belief Network is performed via darch() function.

Sample call:

darch  <- darch(train.x, train.y,                 rbm.numEpochs = 0,                 rbm.batchSize = 100,                 rbm.trainOutputLayer = F,                 layers = c(784,100,10),                 darch.batchSize = 100,                 darch.learnRate = 2,                 darch.retainData = F,                 darch.numEpochs = 20 )

This function takes several parameters with the most important ones as follows:

  • x – input data
  • y – target data
  • layers – vector containing one integer for the number of neurons in each layer (including input and output layers)
  • rbm.batchSize – pre-training batch size
  • rbm.trainOutputLayer – boolean used in pre-training. If true, the output layer of RBM is trained as well
  • rbm.numCD – number of full steps for which contrastive divergence is performed
  • rbm.numEpochs – number of epochs for pre-training
  • darch.batchSize – fine-tuning batch size
  • darch.fineTuneFunction– fine-tuning function
  • darch.dropoutInput – dropout rate on the network input
  • darch.dropoutHidden – dropout rate on the hidden layers
  • darch.layerFunctionDefault – default activation function for DBN, available options are {'sigmoidUnitDerivative', 'binSigmoidUnit', 'linearUnitDerivative', 'linearUnit', 'maxoutUnitDerivative', 'sigmoidUnit', 'softmaxUnitDerivative', 'softmaxUnit', 'tanSigmoidUnitDerivative', 'tanSigmoidUnit' }
  • darch.stopErr – stops training if the error is smaller or equal than a threshold
  • darch.numEpochs – number of epochs for fine-tuning
  • darch.retainData – boolean, indicates weather to store the training data in darch instance after training

Based on the previous parameters, we can train our model resulting in an object darch. We can later apply this to a test dataset test.x to make predictions. In that case, an additional parameter type specifies the output type of the prediction. For example, it can be ‘raw’ to give probabilities, ‘bin’ for binary vectors and ‘class’ for class labels. Finally, the prediction is made when calling predict() as follows:

predictions <- predict(darch, test.x, type="bin")

Overall, the basic usage of darch is very simple. It requires only one function to train the network. But on the other hand, the package is limited to deep belief networks, which usually require much more extensive training.

Package “deepnet”

deepnet (deepnet 2015) is a relatively small, yet quite powerful package with variety of architectures to pick from. It can train a feed-forward network using the function nn.train() or initialize weights for the deep belief network with dbn.dnn.train(). This function internally uses rbm.train() to train a restricted Boltzmann machine (which can also be used individually). Furthermore, deepnet can also handle stacked autoencoders via sae.dnn.train().

Sample call (for nn.train()):

nn.train(x, y, initW=NULL, initB=NULL, hidden=c(50,20), activationfun="sigm", learningrate=0.8, momentum=0.5, learningrate_scale=1, output="sigm", numepochs=3, batchsize=100, hidden_dropout=0, visible_dropout=0)

One can set initial weights initW and weights initB which are otherwise randomly generated. In addition, hidden controls the number of units in the hidden layers, whereas activationfun specifies the activation function of the hidden layers (can be ‘sigm’, ‘linear’ or ‘tanh’), as well as of the output layer (can be ‘sigm’, ‘linear’, ‘softmax’).

As an alternative, the following example trains a neural network where the weights are initialized by a deep belief network (via dbn.dnn.train()). The difference is mainly in the contrastive divergence algorithm that trains the restricted Boltzmann machines. It is set via cd, giving the number of iterations for Gibbs sampling inside the learning algorithm.

dbn.dnn.train(x, y, hidden=c(1), activationfun="sigm", learningrate=0.8, momentum=0.5, learningrate_scale=1, output="sigm", numepochs=3, batchsize=100, hidden_dropout=0, visible_dropout=0, cd=1)

Similarly, it is possible to initialize weights from stacked autoencoders. Instead of the parameter output, this example uses sae_output, though it works the same as before.

sae.dnn.train(x, y, hidden=c(1), activationfun="sigm", learningrate=0.8, momentum=0.5, learningrate_scale=1, output="sigm", sae_output="linear", numepochs=3, batchsize=100, hidden_dropout=0, visible_dropout=0)

Finally, we can use a trained network to predict results via nn.predict(). Subsequently, we can transform the predictions with the help of nn.test() into an error rate. The first call requires a neural network and corresponding observations as inputs. The second call additionally needs the correct labels and a threshold when making predictions (default is 0.5).

predictions = nn.predict(nn, test.x) error_rate = nn.test(nn, test.x, test.y, t=0.5)

Altogether, deepnet represents a lightweight package with a restricted set of parameters; however, it offers variety of architectures.

Package “H2O”

H2O is an open-source software platform with the ability to exploit distributed computer systems (H2O 2015). Its core is coded in Java and requires the latest version of JVM and JDK, which can be found at The package provides interfaces for many languages and was originally designed to serve as a cloud-based platform (Candel et al. 2015). Accordingly, one starts H2O by calling h2o.init():

h2o.init(nthreads = -1)

The parameter nthreads specifies how many cores will be used for computation. A value -1 means that H2O will try to use all available cores on the system, though the default is 2. This routine can also work with parameters ip and port in case H2O is installed on a different machine. By default, it uses the ip address together with port 54321. Thus, it is possible to locate the address ‘localhost:54321’ in the browser in order to access a web-based interface. Once your work with the current H2O instance is finished, you need to disconnect via:


Sample call:

All training operations are performed by h2o.deeplearning() as follows:

model <- h2o.deeplearning(   x=x,   y=y,   training_frame=train,   validation_frame=test,   distribution="multinomial",   activation="RectifierWithDropout",   hidden=c(32,32,32),   input_dropout_ratio=0.2,   sparse=TRUE,   l1=1e-5,   epochs=100)

The interface for passing data in H2O is a slightly different from other packages: x is a vector containing names of the columns with training data and y is the name of the variable with all the names. The next two parameters, training_frame and validation_frame, are H2O frame objects. They can be created by calling h2o.uploadFile(), which takes a directory path as an argument and loads a csv file into the environment. The use of a specific data class is motivated by the distributed environment, since the data should be available across the whole cluster. The parameter distribution is a string and can take the values ‘bernoulli’, ‘multinomial’, ‘poisson’, ‘gamma’, ‘tweedie’, ‘laplace’, ‘huber’ or ‘gaussian’, while ‘AUTO’ automatically picks a parameter based on the data. The following parameter specifies the activation function (possible values are ‘Tanh’, ‘TanhWithDropout’, ‘Rectifier’, ‘RectifierWithDropout’, ‘Maxout’ or ‘MaxoutWithDropout’). The parameter sparse is a boolean value denoting a high degree of zeros, which allows H2= to handle it more efficiently. The remaining parameters are intuitive and do not differ much from other packages. There are, however, many more available for fine-tuning, but it will probably not be necessary to change them since they come with recommended, pre-defined values.

Finally, we can make predictions using h2o.predict() with the following signature:

predictions <- h2o.predict(model, newdata=test_data)

Another powerful tool that H2O offers is the grid search for optimizing the hyperparameters. It is possible to specify sets of values for each parameter and then find the best combination via h2o.grid().

Hyperparameter optimization

hidden_par <- list(c(50,20,50), c(32,32,32)) l1_par <- c(1e-3,1e-8) hyperp <- list(hidden=hidden_par, l1=l1_par) model_grid <- h2o.grid("deeplearning",                        hyper_params=hyperp,                        x=x,                        y=y,                        distribution="multinomial",                        training_frame=train,                        validation_frame=test)

The H2= package will train four different models with two architectures and different L1-regularization weights. Therefore, it is possible to easily try a number of combinations of hyperparameters and see which one performs better:

for (model_id in model_grid@model_ids) {     model <- h2o.getModel(model_id)     mse <- h2o.mse(model, valid=TRUE)     print(sprintf("MSE on the test set %f", mse)) }

Deep autoencoders

H2O can also exploit deep autoencoders. To train such a model, the same function h2o.deeplearning() is used but the set of parameters is slightly different

anomaly_model <- h2o.deeplearning(   x = names(train),   training_frame = train,   activation = "Tanh",   autoencoder = TRUE,   hidden = c(50,20,50),   sparse = TRUE,   l1 = 1e-4,   epochs = 100)

Here, we use only the training data, without the test set and labels. The fact that we need a deep autoencoder instead of a feed-forward Network is specified by the autoencoder parameter. As before, we can choose how many hidden units should be in different layers. If we use one integer value, we will get a naive autoencoder.

After training, we can study the reconstruction error. We compute it by the specific h2o.anomaly() function.

# Compute reconstruction error (MSE between output and input layers) recon_error <- h2o.anomaly(anomaly_model, test) # Convert reconstruction error data into R data frame recon_error <-

Overall, H2O is a highly user-friendly package that can be used to train feed-forward networks or deep autoencoders. It supports distributed computations and provides a web interface.

Package “deepr”

The package deepr (deepr 2015) doesn’t implement any deep learning algorithms itself but forwards its tasks to H20. The package was originally designed at a time when the H2O package was not yet available on CRAN. As this is no longer the case, we will exclude it from our comparison. We also note that its function train_rbm() uses the deepnet implementation of rbm to train a model with some additional output.

Comparison of Packages

This section compares the aforementioned packages across different metrics. Among these are ease-of-use, flexibility, ease-of-installation, support of parallel computations and assistance in choosing hyperparameters. In addition, we measure the performance across three common datasets ‘Iris’, ‘MNIST’ and ‘Forest Cover Type’. We hope that our comparison aids practitioners and researchers in choosing their preferred package for deep learning.


Installing packages that are available via CRAN is usually very simple and smooth. However, some packages depend on third party libraries. For example, H2O requires the latest version of Java, as well as Java Development Kit. The darch and MXNetR packages allow the use of GPU. For that purpose, darch depends on R package gputools, which is only supported on Linux and MacOS systems. MXNetR is by default shipped without GPU support due to its dependence on cuDNN, which cannot be included in the package because of licensing restrictions. Thus, the GPU version of MXNetR requires Rtools and a modern compiler with C++11 support to compile MXNet from source with CUDA SDK and cuDNN.


With respect to flexibility, MXNetR is most likely at the top of the list. It allows one to experiment with different architectures due to its layer-wise approach of defining the network, not to mention the rich variety of parameters. In our opinion, we think that both H2O and darch score second place. H20 predominantly addresses feed-forward networks and deep autoencoders, while darch focuses on restricted Boltzmann machines and deep belief networks. Both packages offer a broad range of tuning parameters. Last but not least, deepnet is a rather lightweight package but it might be beneficial when one wants to play around with different architectures. However, we do not recommend it for day-to-day use with huge datasets as its current version lacks GPU support and the relatively small set of parameters does not allow fine-tuning to the fullest.


H2O and MXNetR stand out for their speed and ease of use. MXNetR requires little to no preparation of data to start training and H2O offers a very intuitive wrapper by using the as.h2o() function, which converts data to the H2OFrame object. Both packages provide additional tools to examine models. deepnet takes labels in the form of one-hot encoding matrix. This usually requires some pre-processing since most of the datasets have their classes in a vector format. However it does not report very detailed information regarding the progress during training. The package also lacks additional tools for examining models. darch, on the other hand, has a very nice and verbose output.

Overall, we see H2O or MXNetR as the winners in this category, since both are fast and provide feedback during training. This allows one to quickly adjust parameters and improve the predictive performance.


Deep learning is common when dealing with massive datasets. As such, it can be of tremendous help when the packages allow for some degree of parallelization. Table 2 compares the support of parallelization. It shows only explicitly stated information from the documentation.

Table 2. Comparison of parallelization.

Package Multiple CPU [Multiple] GPU Cluster Platforms
MXNetR X X Linux\MacOS\Windows
darch X Linux\MaxOS
H20 X X Linux\MacOS\Windows
deepnet No information

Choice of parameters

Another crucial aspect is the choice of hyperparameters. The H2O package uses a fully-automated per-neuron adaptive learning rate for fast convergence. It also has an option to use n-folds cross validation and offers the function h2o.grid() for grid search in order to optimize hyperparameters and model selection.

MXNetR displays the training accuracy after each iteration. darch shows the error after each epoch. Both allow for manually experimenting with different hyperparameters without waiting for the convergence, since the training phase can be terminated earlier in case the accuracy doesn’t improve. In contrast, deepnet doesn’t display any information until training is completed, which makes tweaking the hyperparameters very challenging.

Performance and runtime

We prepared a very simple comparison of performance in order to provide our readers with information on the efficiency. All subsequent measurements were made on a system with CPU Intel Core i7 and GPU NVidia GeForce 750M, Windows OS. The comparison is carried out on three datasets: ‘MNIST’ (LeCun et al. 2012), ‘Iris’ (Fisher 1936) and ‘Forest Cover Type’ (Blackard and Dean 1998). Details are provided in the appendix.

As a baseline, we use the random forest algorithm as implemented in the H2O package. The random forest is an ensemble learning method that works by constructing multiple decision trees (Wikipedia 2016b). Interestingly, it has proved its ability to achieve a high performance while working out-of-the-box without parameter tuning to a large extent.


The results of the measurements are presented in Table 3 and also visualized in Figures 2, 3, and 4 for the ‘MNIST’, ‘Iris’ and ‘Forest Cover Type’ datasets, respectively.

  • ‘MNIST’ dataset. According to Table 3 and Figure 2, MXNetR and H2O achieve a superior trade-off between runtime and predictive performance on the ‘MNIST’ dataset. darch and deepnet take a relatively long time to train the networks while simultaneously achieving a lower accuracy.
  • ‘Iris’ dataset. Here, we see again that MXNetR and H2O perform best. As can be seen from Figure 3, deepnet has the lowest accuracy, probably because it is such a tiny dataset where the pre-training is misleading. Because of this, darch 100 and darch 500/300 were trained through backpropagation, omitting a pre-training phase. This is marked by the * symbol in the table.
  • ‘Forest Cover Type’ dataset. H2O and MXNetR show an accuracy of around 67%, but this is still better that the remaining packages. We note that the training of darch 100 and darch 500/300 didn’t converge, and the models have thus been excluded from this comparison.

We hope that even this simple performance comparison can provide valuable insights for practitioners when choosing their preferred R package.

Note: It can be seen from Figures 3 and 4 that the random forest can perform better than the deep learning packages. There are several valid reasons for this. First, the datasets are too small as Deep Learning usually requires big data or the use of data augmentation to function properly. Second, the data in these datasets consists of hand-made features, which negates the advantage of deep architectures to learn those features from raw data, and, therefore, traditional methods might be sufficient. Finally, we choose very similar (and probably not the most efficient) architectures in order to compare the different implementations..

Table 3. Comparison of accuracy and runtime across different deep learning packages in R.
* Models that were trained with backpropagation only (no pre-training).

Model/Dataset MNIST Iris Forest Cover Type
Accuracy (%) Runtime (sec) Accuracy (%) Runtime (sec) Accuracy (%) Runtime (sec)
MXNetR (CPU) 98.33 147.78 83.04 1.46 66.80 30.24
MXNetR (GPU) 98.27 336.94 84.77 3.09 67.75 80.89
darch 100 92.09 1368.31 69.12 * 1.71
darch 500/300 95.88 4706.23 54.78 * 2.10
deepnet DBN 97.85 6775.40 30.43 0.89 14.06 67.97
deepnet DNN 97.05 2183.92 78.26 0.42 26.01 25.67
H2O 98.08 543.14 89.56 0.53 67.36 5.78
Random Forest 96.77 125.28 91.30 2.89 86.25 9.41

Figure 2. Comparison of runtime and accuracy for the ‘MNIST’ dataset.

Figure 3. Comparison of runtime and accuracy for the ‘Iris’ dataset.

Figure 4. Comparison of runtime and accuracy for the ‘Forest Cover Type’ dataset.


As part of this article, we have compared five different packages in R for the purpose of deep learning: (1) the current version of deepnet might represent the most differentiated package in terms of available architectures. However, due to its implementation, it might not be the fastest nor the most user-friendly option. Furthermore, it might not offer as many tuning parameters as some of the other packages. (2) H2O and MXNetR, on the contrary, offer a highly user-friendly experience. Both also provide output of additional information, perform training quickly and achieve decent results. H2O might be more suited for cluster environments, where data scientists can use it for data mining and exploration within a straightforward pipeline. When flexibility and prototyping is more of a concern, then MXNetR might be the most suitable choice. It provides an intuitive symbolic tool that is used to build custom network architectures from scratch. Additionally, it is well optimized to run on a personal computer by exploiting multi CPU/GPU capabilities. (3) darch offers a limited but targeted functionality focusing on deep belief networks.

Altogether, we see that R support for deep learning is well on its way. Initially, the offered capabilities of R were lagging behind other programming languages. However, this is no longer the case. With H20 and MXnetR, R users have two powerful tools at their fingertips. In the future, it would be desirable to see further interfaces – e.g. for Caffe or Torch.


‘MNIST’ is a well-known digit recognition dataset. It contains 60,000 training samples and 10,000 test samples with labels and can be downloaded in csv format from The ‘Forest Cover Type’ dataset originates from a Kaggle challenge and can be found at It contains 15,120 labeled observations that we divide into 70% training set and 30% test set. It has 54 features and 7 output classes of cover type. The ‘Iris’ dataset is also very popular in machine learning. It is a tiny dataset with 3 classes and 150 samples, and we also subdivide it in a 70/30 ratio for training and testing. We immediately observe that practical applications require far larger datasets to unleash the full potential of deep learning. We are aware of this issue but, nevertheless, want to provide a very plain comparison. However, our experiments indicate that not all packages might be suitable for big data and can thus still provide decision support to practitioners.

For the ‘MNIST’ dataset, all networks were designed to have 2 hidden layers with 500 and 300 units, respectively. One exception is darch 100, which has one hidden layer with 100 elements. For other datasets the number of hidden units was reduced by the factor of ten and, hence, architectures have 2 hidden layers with 50 and 30 units, respectively. Where possible, the array batch size was set to 500 elements, momentum to 0.9, learning rate to 0.07 and dropout ratio to 0.2. Number of rounds (in MXNetR) or epochs (in other packages) was set to 50. darch architectures used pre-training with 15 epochs and batch size 100.

The ‘Iris’ dataset is tiny compared to the others. It has only 150 samples that were randomly shuffled and divided for training and test sets. Therefore, all numbers in tables referring to it were averaged across 5 runs. The batch size parameter was reduced to 5 and the learning rate to 0.007.

The third dataset is the ‘Forest Cover Type’, which has 15,120 samples. The architecture of the networks was the same as for the ‘Iris’ dataset. As this dataset is more challenging, the number of epochs was increased from 50 to 100.


Blackard, J. A., and Dean, D. J. 1998. “Comparative accuracies of neural networks and discriminant analysis in predicting forest cover types from cartographic variables,” in Proc. second southern forestry gIS conf, pp. 189–199.

Candel, A., Parmar, V., LeDell, E., and Arora, A. 2015. “Deep learning with h2O,”

darch. 2015. “Package darch,” (available at

deepnet. 2015. “Package deepnet,” (available at

deepr. 2015. “Deepr,” (available at; retrieved January 9, 2016).

Fisher, R. A. 1936. “The use of multiple measurements in taxonomic problems,” Annals of eugenics (7:2), pp. 179–188.

H2O. 2015. “Package h2o,” (available at

LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R. 2012. “Efficient backprop,” in Neural networks: Tricks of the trade, Springer, pp. 9–48.

MXNetR. 2016a. “MXNet r package: Mxnet 0.5.0 documentation,” (available at; retrieved January 9, 2016).

MXNetR. 2016b. “Installation guide: Mxnet 0.5.0 documentation,” (available at; retrieved January 9, 2016).

Nair, V., and Hinton, G. E. 2010. “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (iCML-10), pp. 807–814.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. 2014. “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research (15:1), pp. 1929–1958.

Wikipedia. 2016a. “Wikipedia: Deep learning,” (available at; retrieved March 17, 2016).

Wikipedia. 2016b. “Wikipedia: Random forest,” (available at; retrieved February 3, 2016).