Understanding Naïve Bayes Classifier Using R

The Best Algorithms are the Simplest The field of data science has progressed from simple linear regression models to complex ensembling techniques but the most preferred models are still the simplest and most interpretable. Among them are regression, logistic, trees and naive bayes techniques. Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets. Naive bayes is a common technique used in the field of medical science and is especially used for cancer detection. This article explains the underlying logic behind naive bayes algorithm and example implementation. How Probability defines Everything We calculate probability as the proportion of cases where an event happens and call it the probability of the event. Just as there is probability for a single event, we…
Original Post: Understanding Naïve Bayes Classifier Using R

Exploring handwritten digit classification: a tidy analysis of the MNIST dataset

In a recent post, I offered a definition of the distinction between data science and machine learning: that data science is focused on extracting insights, while machine learning is interested in making predictions. I also noted that the two fields greatly overlap: I use both machine learning and data science in my work: I might fit a model on Stack Overflow traffic data to determine which users are likely to be looking for a job (machine learning), but then construct summaries and visualizations that examine why the model works (data science). This is an important way to discover flaws in your model, and to combat algorithmic bias. This is one reason that data scientists are often responsible for developing machine learning components of a product. I’d like to further explore how data science and machine learning complement each other, by…
Original Post: Exploring handwritten digit classification: a tidy analysis of the MNIST dataset

Automating Basic EDA

In any model development exercise, a considerable amount of time is spent in understanding the underlying data, visualizing relationships and validating preliminary hypothesis (broadly categorized as Exploratory data Analysis). A key element of EDA involves visually analyzing the data to glean valuable insights and understand underlying relationships & patterns in the data. While EDA is defined more as a philosophy rather than a defined set of procedures and techniques, there is a certain set of standard analysis that you would most likely perform as part of EDA to gain an initial understanding of the data. This post provides an overview of a package RtutoR that I had developed some time back and have recently added a few new functionalities to automate some elements of EDA. In nutshell, the functionalities provided in the package would help you: Automatically generate common univariate…
Original Post: Automating Basic EDA

Rblpapi 0.3.8: Strictly maintenance

Another Rblpapi release, now at version 0.3.8, arrived on CRAN yesterday. Rblpapi provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg Labs (but note that a valid Bloomberg license and installation is required). This is the eight release since the package first appeared on CRAN in 2016. This release wraps up a few smaller documentation and setup changes, but also includes an improvement to the (less frequently-used) subscription mode which Whit cooked up on the weekend. Details below: Changes in Rblpapi version 0.3.8 (2018-01-20) The 140 day limit for intra-day data histories is now mentioned in the getTicks help (Dirk in #226 addressing #215 and #225). The Travis CI script was updated to use run.sh (Dirk in #226). The install_name_tool invocation under macOS was corrected (@spennihana in #232) The blpAuthenticate help page…
Original Post: Rblpapi 0.3.8: Strictly maintenance

How to recruit data analysts for the public sector by @ellis2013nz

A management challenge Between 2011 and 2017 I selected somewhere between 20 and 30 staff and contractors, for New Zealand public sector roles with titles like Analyst, Senior Analyst and Principal Analyst. Alternative names for these roles could have been (and sometimes were) “researcher”, “data analyst”, “statistician”, “R developer”, “data scientist” and in one case “Shiny developer” – with or without “Senior” or “Principal” in front of the name. I must have been part of over 50 job interviews for such roles, and read some hundreds of technical exercises and perhaps more than 1,000 job applications. Perhaps it was only 500 applicants, counting people who applied more than once; but it was certainly lots. This was when my own positions had titles such as Manager Tourism Research and Evaluation, Manager Sector Trends, and General Manager – Evidence and Insights. This…
Original Post: How to recruit data analysts for the public sector by @ellis2013nz

Major update to BatchGetSymbols

Making it even easier to download and organize stock prices from Yahoo Finance – I just released a long due update to package BatchGetSymbols. Thefiles are under review in CRAN and you should get the update soon.Meanwhile, you can install the new version from Github: if (!require(devtools)) install.packages(‘devtools’) devtools::install_github(‘msperlin/BatchGetSymbols’) The main innovations are: Clever cache system: By default, every new download of data willbe saved in a local file located in a directory chosen by user.Every new request of data is compared to the available localinformation. If data is missing, the function only downloads thepiece of data that is missing. This make the call to functionBatchGetSymbols a lot faster! When updating an existing dataset ofprices, the function only downloads the new available data that ismissing from the local files. Returns calculation: Function now returns a return vector indf.tickers. Returns are…
Original Post: Major update to BatchGetSymbols

Fatal Journeys: Visualizing the Horror

In war, truth is the first casualty (Aeschylus) I am not a pessimistic person. On the contrary, I always try to look at the bright side of life. I also believe that living conditions are now better than years ago as these plots show. But reducing the complexity of our world to just six graphs is riskily simplistic. Our world is quite far of being a fair place and one example is the immigration drama. Last year there were 934 incidents around the world involving people looking for a better life where more than 5.300 people lost their life o gone missing, 60% of them in Mediterranean. Around 8 out of 100 were children. The missing migrant project tracks deaths of migrants, including refugees and asylum-seekers, who have gone missing along mixed migration routes worldwide. You can find a huge amount…
Original Post: Fatal Journeys: Visualizing the Horror

Predicting Conflict Duration with (gg)plots using Keras

An Unlikely Pairing Last week, Marc Cohen from Google Cloud was on campus to give a hands-on workshop on image classification using TensorFlow. Consequently, I spent most of my time thinking about how I can incorporate image classifiers in my work. As my research is primarily on forecasting armed conflict duration, it’s not really straightforward to make a connection between the two. I mean, what are you going to do, analyse portraits of US presidents to see whether you can predict military use of force based on their facial features? Also, I’m sure someone, somewhere has already done that, given this. For the purposes of this blog post, I went ahead with the second most ridiculous idea that popped into my mind: why don’t I generate images from my research and use them to answer my own research question? This…
Original Post: Predicting Conflict Duration with (gg)plots using Keras

Wrapping Access to Web-Services in R-functions.

One of the great features of R is the possibility to quickly access web-services. While some companies have the habit and policy to document their APIs, there is still a large chunk of undocumented but great web-services that help the regular data scientist. In the following short post, I will show how we can turn a simple web-serivce in a nice R-function.The example I am going to use is the linguee translation service: DeepL.Just as google translate, Deepl features a simple text field. When a user types in text, the translation appears in a second textbox. Users can choose between the languages. In order to see how the service works in the backend, let’s have a quick look at the network traffic.For that we open the browser’s developer tools and jump to the network tab. Next, we type in a…
Original Post: Wrapping Access to Web-Services in R-functions.

#15: Tidyverse and data.table, sitting side by side … (Part 1)

Welcome to the fifteenth post in the rarely rational R rambling series, or R4 for short. There are two posts I have been meaning to get out for a bit, and hope to get to shortly—but in the meantime we are going start something else. Another longer-running idea I had was to present some simple application cases with (one or more) side-by-side code comparisons. Why? Well at times it feels like R, and the R community, are being split. You’re either with one (increasingly “religious” in their defense of their deemed-superior approach) side, or the other. And that is of course utter nonsense. It’s all R after all. Programming, just like other fields using engineering methods and thinking, is about making choices, and trading off between certain aspects. A simple example is the fairly well-known trade-off between memory use and…
Original Post: #15: Tidyverse and data.table, sitting side by side … (Part 1)