Mind Bytes: Solving Societal Challenges with Artificial Intelligence

By Francesca Lazzeri (@frlazzeri), Data Scientist at Microsoft Artificial​ ​intelligence​ ​(AI)​ solutions​ ​are playing a growing role in our everyday life, ​​and​ ​are​ ​being adopted​ ​broadly, in private and public domains. ​ ​While​ ​the​ ​notion​ ​of​ ​AI​ ​has​ ​been around​ ​for​ ​over​ ​sixty​ ​years, real-world​ ​AI scenarios and applications​ ​have​ ​only​ ​increased​ ​in​ ​the​ ​last​ ​decade​ ​due​ ​to​ ​three​ ​simultaneous events: ​ ​​improved​ ​computing​ ​power, capability​ ​to​ ​capture​ ​and​ ​store​ ​massive​ ​amounts​ ​of​ ​​​data, and faster​ ​algorithms.   AI solutions help determine the commercials you see online, the movie you will watch with your family, the routes you may take to get to work. Beyond​ ​the​ most popular apps, ​these​ ​systems​ ​are​ ​also​ ​being​ ​implemented​ ​in​ ​critical areas​ such as​ health care, immigration policy, ​finance, ​and​ the ​workplace. The design​ ​and​ ​implementation​ ​of​ ​these AI ​tools​ ​presents​ ​deep societal​ ​challenges​…
Original Post: Mind Bytes: Solving Societal Challenges with Artificial Intelligence

An introduction to seplyr

by John Mount, Win-Vector LLC seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks. The core of seplyr is a re-skinning of dplyr’s functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi). Standard Evaluation and Non-Standard Evaluation “Standard evaluation” is the name we are using for the value oriented calling convention found in many programming languages. The idea is: functions are only allowed to look at the values of their arguments and not how those values arise (i.e., they can not look at source code or variable names). This evaluation principle allows one to transform, optimize, and reason about code. It is what lets us say the following two snippets of code are equivalent. x <- 4; sqrt(x) x <- 4; sqrt(4) The mantra is: “variables can be…
Original Post: An introduction to seplyr

An introduction to seplyr

by John Mount, Win-Vector LLC seplyr is an R packagethat supplies improved standard evaluation interfaces for many common data wrangling tasks. The core of seplyr is are-skinning of dplyr’sfunctionality to seplyr conventions (similar to how stringrre-skins the implementing package stringi). Standard Evaluation and Non-Standard Evaluation “Standard evaluation” is the name we are using for the value oriented calling convention found in many programming languages. The idea is: functions are only allowed to look at the values of their arguments and not how those values arise (i.e., they can not look at source code or variable names). This evaluation principle allows one to transform, optimize, and reason about code. It is what lets us say the following two snippets of code are equivalent. x <- 4; sqrt(x) x <- 4; sqrt(4) The mantra is: “variables can be replaced with their values.”…
Original Post: An introduction to seplyr

How to make Python easier for the R user: revoscalepy

by Siddarth Ramesh, Data Scientist, Microsoft I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist. Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built…
Original Post: How to make Python easier for the R user: revoscalepy

How to make Python easier for the R user: revoscalepy

by Siddarth Ramesh, Data Scientist, Microsoft I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist. Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built…
Original Post: How to make Python easier for the R user: revoscalepy

Scale up your parallel R workloads with containers and doAzureParallel

by JS Tan (Program Manager, Microsoft) The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves. However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter.  Because of R’s rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model’s reproducibility! This is why using containers is extremely important when it comes to operationalizing your R workloads. I’m happy to announce that…
Original Post: Scale up your parallel R workloads with containers and doAzureParallel

Scale up your parallel R workloads with containers and doAzureParallel

by JS Tan (Program Manager, Microsoft) The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves. However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter.  Because of R’s rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model’s reproducibility! This is why using containers is extremely important when it comes to operationalizing your R workloads. I’m happy to announce that…
Original Post: Scale up your parallel R workloads with containers and doAzureParallel

Recap: EARL Boston 2017

By Emmanuel Awa, Francesca Lazzeri and Jaya Mathew, data scientists at Microsoft A few of us got to attend EARL conference in Boston last week which brought together a group of talented users of R from academia and industry. The conference highlighted various Enterprise Applications of R. Despite being a small conference, the quality of the talks were great and showcased various innovative ways in using some of the newer packages available for use in the R language. Some of the attendees were veteran R users while some were new comers to the R language, so there was a mix in the level of proficiency in using the R language.   R currently has a vibrant community of users and there are over 11,000 open source packages. The conference also encouraged women to join their local chapter for R Ladies…
Original Post: Recap: EARL Boston 2017

Recap: EARL Boston 2017

By Emmanuel Awa, Francesca Lazzeri and Jaya Mathew, data scientists at Microsoft A few of us got to attend EARL conference in Boston last week which brought together a group of talented users of R from academia and industry. The conference highlighted various Enterprise Applications of R. Despite being a small conference, the quality of the talks were great and showcased various innovative ways in using some of the newer packages available for use in the R language. Some of the attendees were veteran R users while some were new comers to the R language, so there was a mix in the level of proficiency in using the R language.   R currently has a vibrant community of users and there are over 11,000 open source packages. The conference also encouraged women to join their local chapter for R Ladies…
Original Post: Recap: EARL Boston 2017

Role Playing with Probabilities: The Importance of Distributions

by Jocelyn Barker, Data Scientist at Microsoft I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data scientist. Hidden in its dice rolling mechanics is a perfect little experiment for demonstrating at least one reason why practitioners may resist using statistical methods even when we can demonstrate a better average performance than previous methods. It is all about distributions. While our averages may be higher, the distribution of individual data points can be disastrous. Why Use Role-Playing Games as an Example? Partially because it means I get to think about one…
Original Post: Role Playing with Probabilities: The Importance of Distributions