Create your Machine Learning library from scratch with R ! (3/5) – KNN

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement K nearest neighbors (KNN) using only the linear algebra available in R. Previously, we managed to implement PCA and next time we will deal with SVM and decision trees. The K-nearest neighbors (KNN) is a simple yet efficient classification and regression algorithm. KNN assumes that an observation will be similar to its K closest neighbors. For instance, if most of the neighbors of a given point belongs to a given class, it seems reasonable to assume that the point will belong to the same given class. The mathematics of KNN Now, let’s quickly derive the mathematics used for KNN regression (they are similar for classification). Let be the observations of our training dataset. The…
Original Post: Create your Machine Learning library from scratch with R ! (3/5) – KNN

Create your Machine Learning library from scratch with R ! (2/5) – PCA

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement Principal components analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN). Principal components analysis The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset. The mathematics of PCA Let be the observations of our datasets, the points are in . We assume…
Original Post: Create your Machine Learning library from scratch with R ! (2/5) – PCA

Machine Learning Explained: Vectorization and matrix operations

Today in Machine Learning Explained, we will tackle a central (yet under-looked) aspect of Machine Learning: vectorization. Let’s say you want to compute the sum of the values of an array. The naive way to do so is to loop over the elements and to sequentially sum them. This naive way is slow and tends to get even slower with large amounts of data and large data structures.With vectorization these operations can be seen as matrix operations which are often more efficient than standard loops. Vectorized versions of algorithm are several orders of magnitudes faster and are easier to understand from a mathematical perspective. A basic exemple of vectorization Preliminary exemple – Python  Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large…
Original Post: Machine Learning Explained: Vectorization and matrix operations

Create your Machine Learning library from scratch with R ! (1/3)

When dealing with Machine Learning problems in R, most of the time you rely on already existing libraries. This fastens the analysis process, but do you really understand what is behind the algorithms? Could you implement a logistic regression from scratch with R?The goal of this post is to create our own basic machine learning library from scratch with R. We will only use the linear algebra tools available in R. There will be three posts: Linear and logistic regression (this one) PCA and k-nearest neighbors classifiers and regressors Tree-based methods and SVM Linear Regression (Least-Square) The goal of liner regression is to estimate a continuous variable given a matrix of observations . Before dealing with the code, we need to derive the solution of the linear regression. Solution derivation of linear regression Given a matrix of observations and the…
Original Post: Create your Machine Learning library from scratch with R ! (1/3)

Machine Learning Explained: Kmeans

Kmeans is one of the most popular and simple algorithm to discover underlying structures in your data. The goal of kmeans is simple, split your data in k different groups represented by their mean. The mean of each group is assumed to be a good summary of each observation of this cluster. Kmeans Algorithm We assume that we want to split the data into k groups, so we need to find and assign k centers. How to define and find these centers? They are the solution to the equation:     where     if the observation i is assigned to the center j and 0 otherwise. Basically, this equation means that we are looking for the k centers which will minimize the distance between them and the points of their cluster. This is an optimization problem, but since the function, we want to…
Original Post: Machine Learning Explained: Kmeans

Explore your McDonalds Meal with Shiny and D3partitionR

Have you ever wondered what was in your MacDonalds menu? Or in your DoubleCheese Burger (well it’s my favorite one)? A wonderful dataset was released a few months ago, it contains all the nutrition facts from McDonald’s items. You can find the dataset here. In addition to this, I released a new version of D3partitionR a few weeks ago and was looking for use cases. Hierarchical charts like Sunburst or Treemap are very useful to split and analyze the composition of categories and items. Hence, I decided to make a small Shiny application to analyze the composition and the nutrition value of a MacDonald’s menu. And here is the app! Application functionalities The application has four main tabs: Menu selection Calories explorer Nutrients explorer Daily Value explorer Menu Selection The menu selection is used to … select the items you want to…
Original Post: Explore your McDonalds Meal with Shiny and D3partitionR

Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js

D3partitionR is an R package to visualize interactively nested and hierarchical data using D3.js and HTML widget. These last few weeks I’ve been working on a major D3partitionR update which is now available on GitHub. As soon as enough feedbacks are collected, the package will be on uploaded on the CRAN. Until then, you can install it using devtools library(devtools) install_github(“AntoineGuillot2/D3partitionR”) Here is a quick overview of the possibilities using the Titanic data: A major update This update is a major update from the previous version which will break code from 0.3.1 New functionalities Additional data for nodes: Additional data can be added for some given nodes. For instance, if a comment or a link needs to be shown in the tooltip or label of some nodes, they can be added through the add_nodes_data function You can easily add specific…
Original Post: Major update of D3partitionR: Interactive viz’ of nested data with R and D3.js