Top 7 Libraries and Packages for Data Science and AI: Python & R

This is a list of the best libraries and packages that changed our lives this year, compiled from my weekly digests.

Introduction

If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R, where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.

The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).

If you want to read the past digests, take a look here:

Top 7 for Python

7. AdaNet — Fast and flexible AutoML with learning guarantees.

AdaNet is a lightweight and scalable TensorFlow AutoML framework for training and deploying adaptive neural networks using the AdaNet algorithm [Cortes et al. ICML 2017]. AdaNet combines several learned subnetworks in order to mitigate the complexity inherent in designing effective neural networks.

This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.

You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.

You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.

Installation

adanet depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:

Installing from source

To install from source, you’ll first need to install bazel following their installation instructions.

Next clone adanet and cd into its root directory:

From the adanet root directory run the tests:

Once you have verified that everything works well, install adanet as a pip package .

You’re now ready to experiment with adanet.

Usage

Here you can find two examples on the usage of the package:

You can read more about it in the original blog post:

6. TPOT— An automated Python machine learning tool that optimizes machine learning pipelines using genetic programming.

Previously I talked about Auto-Keras, a great library for AutoML in the Pythonic world. Well, I have another very interesting tool for that.

The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming.

TPOT can automate a lot of stuff like feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.

What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

This is how it works:

For more details you can read theses great article by Matthew Mayo:

and Randy Olson:

Installation

You actually need to follow some instructions before installing TPOT. Here they are:

After that you can just run:

Examples:

First let’s start with the basic Iris dataset:

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load iris dataset
iris = load_iris()

# Split the data

X_trainX_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    train_size=0.75, test_size=0.25)

# Fit the TPOT classifier 

tpot = TPOTClassifier(verbosity=2, max_time_mins=2)
tpot.fit(X_train, y_train)

# Export the pipeline
tpot.export('tpot_iris_pipeline.py')

So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the iris.target. And then we save that pipeline. After that, what we have to do is very simple — load the .py file you generated and you’ll see:

And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.

Let’s go the MNIST dataset now:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# load and split dataset 
digitsdigits  ==  load_digitsload_di ()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

# Fit the TPOT classifier 
tpot = TPOTClassifier(verbosity=2, max_time_mins=5, population_size=40)
tpot.fit(X_train, y_train)

# Export pipeline
tpot.export('tpot_mnist_pipeline.py')

As you can see, we did the same! Let’s load the .py file you generated again and you’ll see:

Super easy and fun. Check them out! Try it and please give them a star!

5. SHAP — A unified approach to explain the output of any machine learning model

Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.

Installation

SHAP can be installed from PyPI

or conda-forge

Usage

There are tons of different models and ways to use the package. Here, I’ll take one example from the DeepExplainer.

Deep SHAP is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT, as described in the SHAP NIPS paper that you can read here:

Here you can see how SHAP can be used to explain the result of a Keras model for the MNIST dataset:

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Front Page DeepExplainer MNIST Examplen",
    "n",
    "A simple example showing how to explain an MNIST CNN trained using Keras with DeepExplainer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x_train shape: (60000, 28, 28, 1)n",
      "60000 train samplesn",
      "10000 test samplesn",
      "Train on 60000 samples, validate on 10000 samplesn",
      "Epoch 1/2n",
      "60000/60000 [==============================] - 135s 2ms/step - loss: 0.2570 - acc: 0.9211 - val_loss: 0.0624 - val_acc: 0.9798n",
      "Epoch 2/2n",
      "60000/60000 [==============================] - 132s 2ms/step - loss: 0.0876 - acc: 0.9736 - val_loss: 0.0456 - val_acc: 0.9858n",
      "Test loss: 0.045611984347738325n",
      "Test accuracy: 0.9858n"
     ]
    }
   ],
   "source": [
    "# this is the code from https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.pyn",
    "from __future__ import print_functionn",
    "import kerasn",
    "from keras.datasets import mnistn",
    "from keras.models import Sequentialn",
    "from keras.layers import Dense, Dropout, Flattenn",
    "from keras.layers import Conv2D, MaxPooling2Dn",
    "from keras import backend as Kn",
    "n",
    "batch_size = 128n",
    "num_classes = 10n",
    "epochs = 12n",
    "n",
    "# input image dimensionsn",
    "img_rows, img_cols = 28, 28n",
    "n",
    "# the data, split between train and test setsn",
    "(x_train, y_train), (x_test, y_test) = mnist.load_data()n",
    "n",
    "if K.image_data_format() == 'channels_first':n",
    "    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)n",
    "    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)n",
    "    input_shape = (1, img_rows, img_cols)n",
    "else:n",
    "    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)n",
    "    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)n",
    "    input_shape = (img_rows, img_cols, 1)n",
    "n",
    "x_train = x_train.astype('float32')n",
    "x_test = x_test.astype('float32')n",
    "x_train /= 255n",
    "x_test /= 255n",
    "print('x_train shape:', x_train.shape)n",
    "print(x_train.shape[0], 'train samples')n",
    "print(x_test.shape[0], 'test samples')n",
    "n",
    "# convert class vectors to binary class matricesn",
    "y_train = keras.utils.to_categorical(y_train, num_classes)n",
    "y_test = keras.utils.to_categorical(y_test, num_classes)n",
    "n",
    "model = Sequential()n",
    "model.add(Conv2D(32, kernel_size=(3, 3),n",
    "                 activation='relu',n",
    "                 input_shape=input_shape))n",
    "model.add(Conv2D(64, (3, 3), activation='relu'))n",
    "model.add(MaxPooling2D(pool_size=(2, 2)))n",
    "model.add(Dropout(0.25))n",
    "model.add(Flatten())n",
    "model.add(Dense(128, activation='relu'))n",
    "model.add(Dropout(0.5))n",
    "model.add(Dense(num_classes, activation='softmax'))n",
    "n",
    "model.compile(loss=keras.losses.categorical_crossentropy,n",
    "              optimizer=keras.optimizers.Adadelta(),n",
    "              metrics=['accuracy'])n",
    "n",
    "model.fit(x_train, y_train,n",
    "          batch_size=batch_size,n",
    "          epochs=epochs,n",
    "          verbose=1,n",
    "          validation_data=(x_test, y_test))n",
    "score = model.evaluate(x_test, y_test, verbose=0)n",
    "print('Test loss:', score[0])n",
    "print('Test accuracy:', score[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ...include code from https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.pyn",
    "n",
    "import shapn",
    "import numpy as npn",
    "n",
    "# select a set of background examples to take an expectation overn",
    "background = x_train[np.random.choice(x_train.shape[0], 100, replace=False)]n",
    "n",
    "# explain predictions of the model on three imagesn",
    "e = shap.DeepExplainer(model, background)n",
    "# ...or pass tensors directlyn",
    "# e = shap.DeepExplainer((model.layers[0].input, model.layers[-1].output), background)n",
    "shap_values = e.shap_values(x_test[1:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "n",
      "text/plain": [
       "<Figure size 1440x545.455 with 45 Axes>"
      ]
     },
     "metadata": {
      "image/png": {
       "height": 408,
       "width": 1140
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# plot the feature attributionsn",
    "shap.image_plot(shap_values, -x_test[1:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot above shows the explanations for each class on four predictions. Note that the explanations are ordered for the classes 0-9 going left to right along the rows."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

You can find more examples here:

Take a look. You’ll be surprised 🙂

4. Optimus — 🚚 Agile Data Science Workflows made easy with Python and Spark.

Ok, so full disclosure, this library is like my baby. I’ve been working on it for a long time now, and I’m very happy to show you version 2.

Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding .rows and .cols attributes.

With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.

It’s super easy to us. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.

You will see a lot of interesting functions created to help with every step of the data science cycle.

Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.

If you want to read more about an Agile DS Methodology check this out:

Installation (pip):

Usage:

As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:

You can transform this:

+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| id|           firstName|            lastName|billingId|   product|price|     birth|dummyCol|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  1|                Luis|         Alvarez$$%!|      123|      Cake|   10|1980/07/07|   never|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  2|               André|              Ampère|      423|      piza|    8|1950/07/08|   gonna|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  3|               NiELS|          Böhr//((%%|      551|     pizza|    8|1990/07/09|    give|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  4|                PAUL|              dirac$|      521|     pizza|    8|1954/07/10|     you|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  5|              Albert|            Einstein|      634|     pizza|    8|1990/07/11|      up|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  6|             Galileo|             GALiLEI|      672|     arepa|    5|1930/08/12|   never|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  7|                CaRL|            Ga%%%uss|      323|      taco|    3|1970/07/13|   gonna|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  8|               David|          H$$$ilbert|      624|  taaaccoo|    3|1950/07/14|     let|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  9|            Johannes|              KEPLER|      735|      taco|    3|1920/04/22|     you|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 10|               JaMES|         M$$ax%%well|      875|      taco|    3|1923/03/12|    down|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 11|               Isaac|              Newton|      992|     pasta|    9|1999/02/15|  never |
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 12|              Emmy%%|            Nöether$|      234|     pasta|    9|1993/12/08|   gonna|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 13|              Max!!!|           Planck!!!|      111|hamburguer|    4|1994/01/04|    run |
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 14|                Fred|            Hoy&&&le|      553|    pizzza|    8|1997/06/27|  around|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 15|(((   Heinrich )))))|               Hertz|      116|     pizza|    8|1956/11/30|     and|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 16|             William|          Gilbert###|      886|      BEER|    2|1958/03/26|  desert|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 17|               Marie|               CURIE|      912|      Rice|    1|2000/03/22|     you|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 18|              Arthur|          COM%%%pton|      812|    110790|    5|1899/01/01|       #|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| 19|               JAMES|            Chadwick|      467|      null|   10|1921/05/03|       #|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+

into this:

+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| id|firstname|lastname|billingid|          product|price|     birth|  new_date|years_between|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 10|    james| maxwell|      875|             taco|    3|1923/03/12|12-03-1923|      95.4355|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 11|    isaac|  newton|      992|            pasta|    9|1999/02/15|15-02-1999|      19.5108|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 12|     emmy| noether|      234|            pasta|    9|1993/12/08|08-12-1993|      24.6962|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 13|      max|  planck|      111|       hamburguer|    4|1994/01/04|04-01-1994|      24.6237|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 14|     fred|   hoyle|      553|            pizza|    8|1997/06/27|27-06-1997|      21.1452|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 15| heinrich|   hertz|      116|            pizza|    8|1956/11/30|30-11-1956|      61.7204|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 16|  william| gilbert|      886|             BEER|    2|1958/03/26|26-03-1958|      60.3978|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 17|    marie|   curie|      912|             Rice|    1|2000/03/22|22-03-2000|      18.4086|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 18|   arthur| compton|      812|this was a number|    5|1899/01/01|01-01-1899|     119.6317|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
| 19|    james|chadwick|      467|             null|   10|1921/05/03|03-05-1921|       97.293|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
|  7|     carl|   gauss|      323|             taco|    3|1970/07/13|13-07-1970|      48.0995|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
|  8|    david| hilbert|      624|             taco|    3|1950/07/14|14-07-1950|      68.0968|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+
|  9| johannes|  kepler|      735|             taco|    3|1920/04/22|22-04-1920|      98.3253|
+---+---------+--------+---------+-----------------+-----+----------+----------+-------------+

Pretty cool, right?

You can do a thousand more things with the library, so please check it out:

3. spacy — Industrial-strength Natural Language Processing (NLP) with Python and Cython

From the creators:

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

Installation:

Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:

Here’s an example from the main webpage:

In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:

Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:

By the way, the usage page is great, with very good explanations and code:

Take a look at the visualizers page. Awesome features, here:

2. jupytext — Jupyter notebooks as Markdown Documents, Julia, Python or R scripts

For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.

The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs, associate an .ipynb representation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.

You can see a workflow of what you can do with the package in the gif below:

Installation

Install Jupytext with:

Then, configure Jupyter to use Jupytext:

  • generate a Jupyter config, if you don’t have one yet, with jupyter notebook –generate-config
  • edit .jupyter/jupyter_notebook_config.py and append the following:
  • and restart Jupyter, i.e. run:

You can give it a try here:

1. Chartify — Python library that makes it easy for data scientists to create charts.

This, for me, is the winner of the year, for Python. If you are in the Python world, most likely you waste a lot of your time trying to create a decent plot. Luckily, we have libraries like Seaborn that make our life easier. But the issue is that their plots are not dynamic.

Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen:

Chartify is built in top of Bokeh. But it’s also so much simpler.

From the authors:

Why use Chartify?

  • Consistent input data format: Spend less time transforming data to get your charts to work. All plotting functions use a consistent tidy input data format.
  • Smart default styles: Create pretty charts with very little customization required.
  • Simple API: We’ve attempted to make to the API as intuitive and easy to learn as possible.
  • Flexibility: Chartify is built on top of Bokeh, so if you do need more control you can always fall back on Bokeh’s API.

Installation

  1. Chartify can be installed via pip:

2. Install chromedriver requirement (Optional. Needed for PNG output):

  • Install Google Chrome.
  • Download the appropriate version of chromedriver for your OS here.
  • Copy the executable file to a directory within your PATH.
  • View directories in your PATH variable: echo $PATH
  • Copy chromedriver to the appropriate directory, e.g.: cp chromedriver /usr/local/bin

Usage

Let’s say we want to create this chart:

Now that we have some example data loaded let’s do some transformations:

And now we can plot it:

Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:

And more, check the original repo:

Top 7 for R

7. infer — An R package for tidyverse-friendly statistical inference

Inference, or statistical inference, is the process of using data analysis to deduce properties of an underlying probability distribution.

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.

Installation

To install the current stable version of infer from CRAN:

Usage

Let’s try a simple example on the mtcars dataset to see what the library can do for us.

First, let’s overwrite mtcars so that the variables cyl, vs, am, gear, and carb are factors.

We’ll try hypothesis testing. Here, a hypothesis is proposed so that it’s testable on the basis of observing a process that’s modeled via a set of random variables. Normally, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model.

Here, we first specify the response and explanatory variables, then we declare a null hypothesis. After that, we generate resamples using bootstrap and finally calculate the median. The result of that code is:

One of the greatest parts of this library is the visualize function. This will allow you to visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both). For an example, let’s use the flights data set. First, let’s do some data preparation:

And now we can run a randomization approach to χ2-statistic:

or see the theoretical distribution:

For more on this package visit:

6. janitor — simple tools for data cleaning in R

Data cleansing is a topic very close to me. I’ve been working with my team at Iron-AI to create a tool for Python called Optimus. You can see more about it here:

But this tool I’m showing you is a very cool package with simple functions for data cleaning.

It has three main functions:

  • perfectly format data.frame column names;
  • create and format frequency tables of one, two, or three variables (think an improved table(); and
  • isolate partially-duplicate records.

Oh, and it’s a tidyverse-oriented package. Specifically, it works nicely with the %>% pipe and is optimized for cleaning data brought in with the readr and readxl packages.

Installation

Usage

I’m using the example from the repo, and the data dirty_data.xlsx.

With this:

With the clean_names() function, we’re telling R that we’re about to use janitor. Then we clean the empty rows and columns, and then using dplyr we change the format for the dates, create a new column with the information of certification and certification_1, and then drop them.

And with this piece of code…

we can find duplicated records that have the same name and last name.

The package also introduces the tabyl function that tabulates the data, like table but pipe-able, data.frame-based, and fully featured. For example:

You can do a lot more things with the package, so visit their site and give them some love 🙂

5. Esquisse — RStudio add-in to make plots with ggplot2

This add-in allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, and histograms, and then export the graph or retrieve the code generating the graph.

Installation

Install from CRAN with :

Usage

Then launch the add-in via the RStudio menu. If you don’t have data.framein your environment, datasets in ggplot2 are used.

ggplot2 builder addin

Launch the add-in via the RStudio menu or with:

The first step is to choose a data.frame:

Or you can use a dataset directly with:

After that, you can drag and drop variables to create a plot:

You can find information about the package and sub-menus in the original repo:

4. DataExplorer — Automate data exploration and treatment

Exploratory Data Analysis (EDA) is an initial and important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process can be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

Usage

With the package you can create reports, plots, and tables like this:

You can find much more like this on the package’s official webpage:

And in this vignette:

3. Sparklyr — R interface for Apache Spark

Sparklyr will allow you to:

  • Connect to Spark from R. The sparklyr package provides a
    complete dplyr backend.
  • Filter and aggregate Spark datasets, and then bring them into R for
    analysis and visualization.
  • Use Spark’s distributed machine learning library from R.
  • Create extensions that call the full Spark API and provide
    interfaces to Spark packages.

Installation

You can install the Sparklyr package from CRAN as follows:

You should also install a local version of Spark for development purposes:

Usage

The first part of using Spark is always creating a context and connecting to a local or remote cluster.

Here we’ll connect to a local instance of Spark via the spark_connect function:

Using sparklyr with dplyr and ggplot2

We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

To start with, here’s a simple filtering example:

Let’s plot the data on flight delays:

Machine Learning with Sparklyr

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within Sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Here’s an example where we use ml_linear_regression to fit a linear regression model. We’ll use the built-in mtcars dataset to see if we can predict a car’s fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). We’ll assume in each case that the relationship between mpg and each of our features is linear.

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit and the statistical significance of each of our predictors.

Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above, it’s easy to chain these functions together with dplyr pipelines.

Check out more about machine learning with sparklyr here:

And more information in general about the package and examples here:

2. Drake — An R-focused pipeline toolkit for reproducibility and high-performance computing

Nope, just kidding. But the name of the package is drake!

This is such an amazing package. I’ll create a separate post with more details about it, so wait for that!

Drake is a package created as a general-purpose workflow manager for data-driven tasks. It rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.

Also, not every run-through starts from scratch, and completed workflows have tangible evidence of reproducibility.

Reproducibility, good management, and tracking experiments are all necessary for easily testing others’ work and analysis. It’s a huge deal in Data Science, and you can read more about it here:

From Zach Scott:

And in an article by me 🙂

With drake, you can automatically

  1. Launch the parts that changed since last time.
  2. Skip the rest.

Installation

There are some known errors when installing from CRAN. For more on these errors, visit:

I encountered a mistake, so I recommend that for now you install the package from GitHub.

Ok, so let’s reproduce a simple example with a twist:

#library(devtools)
#install_github("ropensci/drake")
library(dplyr)
library(ggplot2)
library(drake)


# Donwload neccesary data
drake_example("main")

# Check if data and report exists
file.exists("main/raw_data.xlsx")
file.exists("main/report.Rmd")

# Crate a custom plot function
create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram(binwidth = 0.25) +
    theme_gray(20)
}

plot_lm <- function(data) {
  ggplot(data = data, aes(x = Petal.Width, y = Sepal.Width)) + 
    geom_point(color='red') +
    stat_smooth(method = "lm", col = "red")
}

# Create the plan
plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("main/raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)) %>%
    select(-X__1),
  hist = create_plot(data),
  cor = cor(data$Petal.Width,data$Sepal.Width),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  plot = plot_lm(data),
  report = rmarkdown::render(
    knitr_in("main/report.Rmd"),
    output_file = file_out("main/report.html"),
    quiet = TRUE
  )
)

plan

# Excecute the plan
make(plan)

# Interactive graph: hover, zoom, drag, etc.
config <- drake_config(plan)
vis_drake_graph(config)

I added a simple plot to see the linear model within drake’s main example. With this code, you’re creating a plan for executing your whole project.

First, we read the data. Then we prepare it for analysis, create a simple hist, calculate the correlation, fit the model, plot the linear model, and finally create a rmarkdown report.

The code I used for the final report is here:

---
title: "Example R Markdown drake file target"
author: Will Landau, Kirill Müller and Favio ;)
output: html_document
---

Run `make.R` to generate the output `report.pdf` and its dependencies. Because we use `loadd()` and `readd()` below, `drake` knows `report.pdf` depends on targets `fit`, and `hist`.

```{r content}
library(drake)
loadd(fit)
print(fit)
loadd(cor)
print(cor)
readd(hist)
readd(plot)
```

More:

- Walkthrough: [this chapter of the user manual](https://ropenscilabs.github.io/drake-manual/intro.html)
- Slides: [https://krlmlr.github.io/drake-pitch](https://krlmlr.github.io/drake-pitch)
- Code: `drake_example("main")`

If we change some of our functions or analysis, when we execute the plan, drake will know what has changed and will only run those changes. It creates a graph so you can see what’s happening:

In Rstudio, this graph is interactive, and you can save it to HTML for later analysis.

There are more awesome things that you can do with drake that I’ll show in a future post 🙂

1. DALEX — Descriptive mAchine Learning EXplanations

Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. For example:

(By the way, sometimes a simple visualization with ggplot can help you explain a model. For more on this check the awesome article below by Matthew Mayo)

In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.DALEX is a set of tools that helps explain how complex models are working.

To install from CRAN, just run:

They have amazing documentation on how to use DALEX with different ML packages:

Great cheat sheets:

Here’s an interactive notebook where you can learn more about the package:

And finally, some book-style documentation on DALEX, machine learning, and explainability:

Check it out in the original repository:

and remember to star it 🙂

Thanks to the amazing team at Ciencia y Datos for helping with these digests.

Thanks also for reading this. I hope you found something interesting here :). If these articles are helping you please share them with your friends!

If you have questions just follow me on Twitter:

and LinkedIn:

See you there 🙂

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square