[unable to retrieve full-text content]Calling all coders and data scientists to join McKinsey’s hackathon. Prize: 5,000 USD + NIPS (Montreal, Canada) + Flights + Accommodation.
Original Post: McKinsey Analytics Online Hackathon, July 20-22
[unable to retrieve full-text content]This article investigates the future of map-making and the role of Sensors, Artificial Intelligence and Machine Learning within that.
Original Post: The Future of Map-Making is Open and Powered by Sensors and AI
[unable to retrieve full-text content]In this tutorial, I use raw bash commands and regex to process raw and messy JSON file and raw HTML page. The tutorial helps us understand the text processing mechanism under the hood.
Original Post: Text Mining on the Command Line
[unable to retrieve full-text content]In this post, I am going to verify this statement using a Principal Component Analysis ( PCA ) to try to improve the classification performance of a neural network over a dataset.
Original Post: Dimensionality Reduction : Does PCA really improve classification outcome?
I’m delighted to announce a new dataviz project called ‘Data to Viz’. —> data-to-viz.com What it is From Data to Viz is a classification of chart types based on input data format. It comes in the form of a decision tree leading to a set of potentially appropriate visualizations to represent the dataset. The decision trees are available in a poster that has been presented at UseR in Brisbane. Philosophie The project is built on two underlying philosophies. First, that most data analysis can be summarized in about twenty different dataset formats. Second, that both data and context determine the appropriate chart. Thus, our suggested method consists in identifying and trying all feasible chart types to find out which suits your data and idea best. Once this set of graphic identified, data-to-viz.com aims to guide you toward the best decision. Content…
Original Post: From Data to Viz | Find the graphic you need
Stencila launches the first version of its (open source) word processor and spreadsheet editor designed for researchers.By Michael Aufreiter, Substance, and Aleksandra Pawlik and Nokome Bentley, Stencila Stencila is an open source office suite designed for researchers. It allows the authoring of interactive, data-driven publications in visual interfaces, similar to those in conventional office suites, but is built from the ground up for reproducibility. Stencila aims to make it easier for researchers with differing levels of computational skills to collaborate on the same research article. Researchers used to tools like Microsoft Word and Excel will find Stencila’s interfaces intuitive and familiar. And those who use tools such as Jupyter Notebook or R Markdown are still able to embed code for data analysis within their research articles. Once published, Stencila documents are self-contained, interactive and reusable, containing all the text, media,…
Original Post: Stencila – an office suite for reproducible research
Introduction During a recent negotiation of an informed consent form for use in a clinical trial, the opposing lawyer and I skirmished over the applicability of the Genetic Information Nondiscrimination Act of 2008, commonly known as GINA. Specifically, the opposing lawyer thought that guidance issued by the U.S. Office for Human Research Protections in 2009 was now outdated, in part because enforcement efforts were erratic. The argument was primarily driven by policy, rather than data. Being a data-driven guy, I wanted to see whether the data supported the argument advanced by the other lawyer. Fortunately, the U.S. Equal Employment Opportunity Commission (EEOC), which is responsible for administering GINA complaints, maintains statistics regarding GINA claims and resolutions. I’m not great at making sense of numbers in a table, so I thought this presented the perfect opportunity to rvest some data! libraries…
Original Post: Is GINA really about to die?!?
In the first part of Introducing the Kernelheaping Package I showed how to compute and plot kernel density estimates on rounded or interval censored data using the Kernelheaping package. Now, let’s make a big leap forward to the 2-dimensional case. Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard area-level mapping methods such as chloropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these area-level data to a smooth point estimate. For the German capital city of Berlin, for example, there exists an open data initiative, where data on e.g. demographics is publicly available. We first load a dataset on the Berlin population, which can be downloaded…
Original Post: Introducing the Kernelheaping Package II
Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many “off-the-shelf” ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization. “Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from hand-written scrawls…[But] most statistical surveys have the identification of causal factors as their ultimate goal.” – Bradley Efron Introduction However, the literature of combining ML and casual inferencing is growing by the day. One common problem of causal inference is the estimation of heterogeneous…
Original Post: Using Machine Learning for Causal Inference
For our July 13th 2018 LIBD rstats club meeting we decided to check as much as we could the useR!2018 conference. Here’s what we were able to figure out about it in about an hour. Hopefully our quick notes will help other rstats enthusiasts, users and developers get a glimpse of the conference. Although there’s bound to me more videos and material about the conference coming out in the following days. Main links: First of all search all the Twitter history for tweets related to the conference by checking the user2018 hashtag. Next, check the videos of the talks. There are more videos there than we can check right now but we hope to come back sometime later and check more talks. All of the #useR2018 presentations (unless specifically requested not), including tutorials are being recorded. These will be available…
Original Post: LIBD rstats club remote useR!2018 notes