U.S. Residential Energy Use: Machine Learning on the RECS Dataset

Contributed by Thomas Kassel. He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from January-May 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course. Introduction The residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity end-uses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns. The Residential Energy Consumption Survey, RECS, collects energy-related data on a nationally representative sample of U.S. homes. First conducted in 1978 and administered every 4-5 years by the Energy Information Administration, it is…
Original Post: U.S. Residential Energy Use: Machine Learning on the RECS Dataset

Complete Subset Regressions, simple and powerful

By Gabriel Vasconcelos The complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013. It is as very simple but powerful technique. Suppose you have a set of variables and you want to forecast one of them using information from the others. If your variables are highly correlated and the variable you want to predict is noisy you will have collinearity problems and in-sample overfitting because the model will try to fit the noise. These problems may be solved if you estimate a smaller model using only a subset of the explanatory variables, however, if you do not know which variables are important you may loose information. What if we estimate models from many different subsets and combine their forecasts? Even better, what if we estimate models for all possible combinations of variables? This…
Original Post: Complete Subset Regressions, simple and powerful

Euler Problem 23: Non-Abundant Sums

A demonstration of the abundance of the number 12 using Cuisenaire rods (Wikipedia). Euler problem 23 asks us to solve a problem with abundant or excessive numbers. These are numbers for which the sum of its proper divisors is greater than the number itself. 12 is an abundant number because the sum of its proper divisors (the aliquot sum) is larger than 12: (1 + 2 + 3 + 4 + 6 = 16). All highly composite numbers or anti-primes greater than six are abundant numbers. These are numbers that have so many divisors that they are considered the opposite of primes, as explained in the Numberphile video below. Euler Problem 23 Definition A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors…
Original Post: Euler Problem 23: Non-Abundant Sums

To eat or not to eat! That’s the question? Measuring the association between categorical variables

  1. Introduction I serve as a reviewer to several ISI and Scopus indexed journals in Information Technology. Recently, I was reviewing an article, wherein the researchers had made a critical mistake in data analysis. They converted the original categorical data to continuous without providing a rigorous statistical treatment, nor, any justification to the loss of information if any. Thus, my motivation to develop this study, is borne out of their error. We know the standard association measure between continuous variables is the product-moment correlation coefficient introduced by Karl Pearson. This measure determines the degree of linear association between continuous variables and is both normalized to lie between -1 and +1 and symmetric: the correlation between variables x and y is the same as that between y and x. the best-known association measure between two categorical variables is probably the chi-square…
    Original Post: To eat or not to eat! That’s the question? Measuring the association between categorical variables

My new DataCamp course: Forecasting Using R

For the past few months I’ve been working on a new DataCamp course teaching Forecasting using R. I’m delighted that it is now available for anyone to do.Course blurb Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Related To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping)…
Original Post: My new DataCamp course: Forecasting Using R

let there be progress

The ‘wrapr’package for use with dplyr programming – UPDATED POST I’m the first to admit I’m not an R expert, (even duffers can blog about it though), but when I began thinking about writing some dplyr functions to help me create and analyse run charts, I had no idea that I was going to struggle quite so much getting to grips with Non Standard Evaluation and Lazy Evaluation (see my last post for links and more background). To clarify, I wanted to create flexible dplyr functions, without hardcoded grouping parameters, so I could query different SQL Server tables and apply the same functions reliably to transform and plot the data. Eventually I sussed things out and created the functions I needed, but it was yet another example of how R makes me feel really stupid on a regular basis.Please tell…
Original Post: let there be progress

Shiny: data presentation with an extra

A Shiny app with three tabs presenting different sections of the same data.Shiny is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with drop-down lists and checkboxes—programming-free. The apps can be useful for both the data analyst and the public. Shiny apps are based on the Internet: This allows for private consultation of the data on one’s own browser as well as for online publication. Free apps can handle quite a lot of data, which can be increased with a subscription. The target user of Shiny is extremely broad. Let’s take science—open science. At a time when openly archiving all data is becoming standard practice (e.g., OSF.io, Figshare.com, Github.com), Shiny can be used to walk the extra mile by letting people tour the data at once without programming. It’s the right tool for acknowledging…
Original Post: Shiny: data presentation with an extra

Top KDnuggets tweets, May 24-30: #DataScience for Beginners; 10 Free Must-Read Books for #MachineLearning and #DataScience

[unable to retrieve full-text content]Real-time face detection and emotion/gender classification; Top 20 #Python #MachineLearning Open Source Projects, updated; Stanford CS231n lecture slides: #DeepLearning software; #DataScience platforms are on the rise
Original Post: Top KDnuggets tweets, May 24-30: #DataScience for Beginners; 10 Free Must-Read Books for #MachineLearning and #DataScience

Watch presentations from R/Finance 2017

It was another great year for the R/Finance conference, held earlier this month in Chicago. This is normally a fairly private affair: with attendance capped at around 300 people every year, it’s a somewhat exclusive gathering of the best and brightest minds from industry and academia in financial data analysis with R. But for the first time this year (and with thanks to sponsorship from Microsoft), videos of the presentations are available for viewing by everyone. I’ve included the complete list (copied from the R/Finance website) below, but here are a few of my favourites: You can find an up-to-date version of the table below at the R/Finance website (click on the “Program” tab), and you can also browse the videos at Channel 9. Note that the lightning talk sessions (in orange) are bundled together in a single video, which you…
Original Post: Watch presentations from R/Finance 2017

Watch presentations from R/Finance 2017

It was another great year for the R/Finance conference, held earlier this month in Chicago. This is normally a fairly private affair: with attendance capped at around 300 people every year, it’s a somewhat exclusive gathering of the best and brightest minds from industry and academia in financial data analysis with R. But for the first time this year (and with thanks to sponsorship from Microsoft), videos of the presentations are available for viewing by everyone. I’ve included the complete list (copied from the R/Finance website) below, but here are a few of my favourites: You can find an up-to-date version of the table below at the R/Finance website (click on the “Program” tab), and you can also browse the videos at Channel 9. Note that the lightning talk sessions (in orange) are bundled together in a single video, which you…
Original Post: Watch presentations from R/Finance 2017