Data Science for Doctors – Part 1 : Data Display

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day, so it is obvious that those people must have a substantial knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills. We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here. This is the first part of the series, it is going to be about data display. Before proceeding, it might be helpful to look over the help pages for the table, pie, geom_bar , coord_polar, barplot, stripchart, geom_jitter, density, geom_density, hist, geom_histogram, boxplot, geom_boxplot, qqnorm, qqline, geom_point, plot, qqline, geom_point . install.packages(‘ggplot2’)library(ggplot) I have also changed the values of…
Original Post: Data Science for Doctors – Part 1 : Data Display

Upcoming Win-Vector LLC public speaking engagements

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements. BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to use our replyr package to greatly improve scripting or programming over dplyr. Some articles on replyr can be found here. Strata & Hadoop World West, Tuesday March 14, 2017 1:30pm–5:00pm, San Jose Convention Center, CA, Location: LL21 C/D. Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R, Spark, sparklyr, h2o, and rsparkling. In partnership with RStudio. Hope to see you there! Related To leave a comment…
Original Post: Upcoming Win-Vector LLC public speaking engagements

This is just a test page for forecastersblog.org

This is just a test page for forecastersblog.org. Ignore it, please. es() allows selecting between AIC (Akaike Information Criterion), AICc (Akaike Information Criterion corrected) and BIC (Bayesian Information Criterion, also known as Schwarz IC). The very basic information criterion is AIC. It is calculated for a chosen model using formula:begin{equation} label{eq:AIC}text{AIC} = -2 ell left(theta, hat{sigma}^2 | Y right) + 2k,end{equation}where (k) is number of parameters of the model. Not going too much into details, the model with the smallest AIC is considered to be the closest to the true model. Obviously IC cannot be calculated without model fitting, which implies that a human being needs to form a pool of models, then fit each of them to the data, calculate an information criterion for each of them and after that select the one model that has the lowest…
Original Post: This is just a test page for forecastersblog.org

On occasion of the 10,000th R package: The eoda Top 10

R just passed another milestone: 10,000 packages on CRAN. From A3 to zyp, from ABC analysis to zero-inflated models – 10,000 R packages mean great variety and methods for almost every use case. On occasion of this event we collected the top 10 R packages in collaboration with the ones who should know best: our data scientists. The eoda Top 10 R-PackagesOur Top 10 R packages Hmisc: This was one of the first R packages to be used at eoda on a regular basis. Today we barely use Hmisc anymore but nonetheless it had to be part of our top 10 simply for nostalgic reasons. data.table: R as an in-memory database with data.table? Who said R was slow? TraceR: Excellent profiling package. Will find every bottleneck. dplyR: Not only fast when it comes to evaluation but also easy to…
Original Post: On occasion of the 10,000th R package: The eoda Top 10

The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users

“Ten Simple Rules for Reproducible Computational Research” is afreely available paper on PLOS computational biology. As I’m currently very interested on the subject of reproducible data analysis, I will these ten rules and the possible implementation in R with my point of view of epidemiologist interested in healthcare data reuse. I will also check if my workflow comply with these rules. For those who are in a hurry, I summarised these rules and possible implementation in R in a table at the end of this post. The author of this paper, Geir Kjetil Sandve, is assistant-professor in a Norwegian biomedical informatics lab. It is possible he wrote these rules with R in mind because he cites R in rule 7. However, biomedical informatics is quite far from my practice and he describes other integrated framework allowing reproducibility. Two successive…
Original Post: The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users

CRAN now has 10,000 R packages. Here’s how to find the ones you need.

CRAN, the global repository of open-source packages that extend the capabiltiies of R, reached a milestone today. There are now more than 10,000 R packages available for download*. (Incidentally, that count doesn’t even include all the R packages out there. There are also another 1294 packages for genomic analysis in the BioConductor repository, hundreds of R packages published only on GitHub, commercial R packages from vendors such as Microsoft and Oracle, and an unknowable number of private, unpublished packages.) Why so many packages? R has a very active developer community, who contribute new packages to CRAN on a daily basis. As a result, R is unparalleled in its capabilities for statistical computing, data science, and data visualization: almost anything you might care to do with data has likely already been implemented for you and released as an R package. There are several…
Original Post: CRAN now has 10,000 R packages. Here’s how to find the ones you need.

January ’17 Tips and Tricks

by Sean Lopp This month’s collection of Tips and Tricks comes from an excellent talk given at the 2017 RStudio::Conf in Orlando by RStudio Software Engineer Kevin Ushey. The slides from his talk are embedded below and cover features from autocompletion to R Markdown shortcuts. Use the left and right arrow keys to change slides. Enjoy! Related To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…
Original Post: January ’17 Tips and Tricks

Simulating from a specified seasonal ARIMA model

From my email today You use an illustration of a seasonal arima model: ARIMA(1,1,1)(1,1,1)4 I would like to simulate data from this process then fit a model… but I am unable to find any information as to how this can be conducted… if I set phi1, Phi1, theta1, and Theta1 it would be reassuring that for large n the parameters returned by Arima(foo,order=c(1,1,1),seasonal=c(1,1,1)) are in agreement… My answer: Unfortunately arima.sim() won’t handle seasonal ARIMA models. I wrote simulate.Arima() to handle them, but it is designed to simulate from a fitted model rather than a specified model. However, you can use the following code to do it. It first “estimates” an ARIMA model with specified coefficients. Then simulates from it. library(forecast) model <- Arima(ts(rnorm(100),freq=4), order=c(1,1,1), seasonal=c(1,1,1), fixed=c(phi=0.5, theta=-0.4, Phi=0.3, Theta=-0.2)) foo <- simulate(model, nsim=1000) fit <- Arima(foo, order=c(1,1,1), seasonal=c(1,1,1)) Related…
Original Post: Simulating from a specified seasonal ARIMA model

The Genetic Map Comparator: a user-friendly application to display and compare genetic maps

The Genetic Map Comparator is an R Shiny application made to compare and characterize genetic maps. You can use it through the online version and read the related publication in Bioinformatics. The biological perspective A genetic map provides the position of genetic markers along chromosomes. Geneticists often have to visualize these maps and calculate their basic statistics (length, # of markers, gap size..). Multiple maps that share some markers have to be dealt with when several segregating populations are studied. These maps are compared to highlight their overall relative strengths and weaknesses (e.g. via marker distributions or map lengths), or local marker inconsistencies. The Genetic Map Comparator is an effective user-friendly tool to complete these tasks. It is possible to upload your genetic maps and explore them using the various sheets accessible via links at the top of the…
Original Post: The Genetic Map Comparator: a user-friendly application to display and compare genetic maps

Reproducible Finance with R: Sector Correlations

by Jonathan Regenstein Welcome to the first installation of reproducible finance for 2017. It’s a new year, a new President takes office soon, and we could be entering a new political-economic environment. What better time to think about a popular topic over the last few years: equity correlations. Elevated correlations are important for several reasons – life is hard for active managers and diversification gains are vanishing – but I personally enjoy thinking about them more from an inference or data exploration perspective. Are changing correlations telling us something about the world? Are sectors diverging? How much can be attributed to the Central Bank regime at hand? So many questions, so many hypotheses to be explored. Let’s get started. Today, we will build a Notebook and start exploring the historical rolling correlations between sector ETFs and the S&P 500.…
Original Post: Reproducible Finance with R: Sector Correlations