Hardwired..for tidy text

Song lyric and sentiment analysis for all So – a while back I did a tidy text analysis on Faith No More lyrics. I had thought about doing this with Metallica album lyrics, as they have had a long career, spanning thier late teens/twenties to their 50’s. However, I found the process of obtaining lyrics and getting them into shape for analysis too painful, so I chose a band with slightly less output. Good news though – things have changed with the release of the geniusr package from Josiah Parry (@JosiahParry). This makes getting song /album lyrics a piece of cake. With my FNM analysis, I obtained individual tracks, organised them into folders by album, and then went through a lot of manual processing ( the site I obtained the lyrics from concatenated each line into a single string). This…
Original Post: Hardwired..for tidy text

Introducing DataFramed, a Data Science Podcast

We are super pumped to be launching a weekly data science podcast called DataFramed, in which Hugo Bowne-Anderson (me), a data scientist and educator at DataCamp, speaks with industry experts about what data science is, what it’s capable of, what it looks like in practice and the direction it is heading over the next decade and into the future. You can check out the podcast here and make sure to subscribe, rate and review! For a sneak peak, check out the trailer above! Instead of answering “what is data science?” merely through the lens of related technologies, tools and skill-sets, a methodology commonly invoked to discover what data science is, we have decided to answer this question by delving into what modern data science looks like in practice via in-depth conversations with practitioners. These are the types of conversations we…
Original Post: Introducing DataFramed, a Data Science Podcast

Speed up R with Parallel Programming in the Cloud

Related To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…
Original Post: Speed up R with Parallel Programming in the Cloud

Speed up R with Parallel Programming in the Cloud

This past weekend I attended the R User Day at Data Day Texas in Austin. It was a great event, mainly because so many awesome people from the R community came to give some really interesting talks. Lucy D’Agostino McGowan has kindly provided a list of the talks and links to slides, and I thoroughly recommend checking it out: you’re sure to find a talk (or two, or five or ten) that interests you. My own talk was on Speeding up R with Parallel Programming in the Cloud, where I talked about using the doAzureParallel package to launch clusters for use with the foreach function, and using aztk to launch Spark clusters for use with the sparklyr package. I’ve embdeded the slides below: it’s not quite the same without the demos (and sadly there was no video recording), but I’ve…
Original Post: Speed up R with Parallel Programming in the Cloud

The EARLy career scholarship

At Mango, we’re passionate about R and promoting its use in enterprise – it’s why we created the EARL Conferences. We understand the importance of sharing knowledge to generate new ideas and change the way organisations use R for the better. This year we are on a mission to actively encourage the attendance of R users who are either in a very early stage of their career or are finishing their academic studies and looking at employment options. We’re offering EARLy career R users a chance to come to EARL – we have a number of 2-day conference passes for EARL London and tickets for each 1-day event in the US. This year’s dates are:London, 12-13 SeptemberSeattle, 7 NovemberHouston, 9 NovemberBoston, 13 November Who can apply? Anyone in their first year of employment Anyone doing an internship or work placement…
Original Post: The EARLy career scholarship

The “cluster of six”

Unsupervised machine learning Hansard reports what’s said in the UK Parliament, sets out details of divisions, and records decisions taken during a sitting. The hansard R package provides functions to import its data. Using the Hansard API (Application Programming Interface), we’ll apply unsupervised machine learning to analyze the voting patterns of 219 Labour Members of Parliament (MPs). We’ll consider all divisions (results of the votes) in the UK House of Commons since the 2017 general election. Supervised machine learning makes predictions from labeled training data. The unsupervised flavour looks for hidden structure in “unlabeled” data, i.e. a classification or categorisation not included in the observations. Hierarchical clustering will identify a cluster of six MPs as the most “distant” from the wider party. The full methodology, including the code, is published here. This extended narrative confirms the suitability of the data for clustering; reviews…
Original Post: The “cluster of six”

Deep Learning for Cancer Immunotherapy

Introduction In my research, I apply deep learning to unravel molecular interactions in the human immune system. One application of my research is within cancer immunotherapy (Immuno-oncology or Immunooncology) – a cancer treatment strategy, where the aim is to utilize the cancer patient’s own immune system to fight the cancer. The aim of this post is to illustrates how deep learning is successfully being applied to model key molecular interactions in the human immune system. Molecular interactions are highly context dependent and therefore non-linear. Deep learning is a powerful tool to capture non-linearity and has therefore proven invaluable and highly successful. In particular in modelling the molecular interaction between the Major Histocompability Complex type I (MHCI) and peptides (The state-of-the-art model netMHCpan identifies 96.5% of natural peptides at a very high specificity of 98.5%). Adoptive T-cell therapy Some brief background…
Original Post: Deep Learning for Cancer Immunotherapy

sparklyr 0.7

We are excited to share that sparklyr 0.7 is now available on CRAN! Sparklyr provides an R interface to Apache Spark. It supports dplyr syntax for working with Spark DataFrames and exposes the full range of machine learning algorithms available in Spark. You can also learn more about Apache Spark and sparklyr in spark.rstudio.com and our new webinar series on Apache Spark. Features in this release:In this blog post, we highlight Pipelines, new ML functions, and enhanced support for data serialization. To follow along in the examples below, you can upgrade to the latest stable version from CRAN with: ML Pipelines The ML Pipelines API is a high-level interface for building ML workflows in Spark. Pipelines provide a uniform approach to compose feature transformers and ML routines, and are interoperable across the different Spark APIs (R/sparklyr, Scala, and Python.) First,…
Original Post: sparklyr 0.7

Introducing Maëlle Salmon, rOpenSci’s new Research Software Engineer

We’re very pleased to be introducing someone who needs no introduction in the R community. Join us in welcoming Maëlle Salmon to rOpenSci as a Research Software Engineer (part time, working from Nancy, France). We’d like to formally introduce her here and share a bit about the kinds of things she’ll be working on. Maëlle did a B.Sc. in Biology with an emphasis on maths and quantitative work, two Masters degrees – one in Ecology and one in Public Health – and a Ph.D. in epidemiological statistics at the Ludwig-Maximilian University in Germany. Her thesis dealt with statistical algorithms for aberration detection in time series of counts of reported cases of infectious diseases. Most recently, Maëlle worked as a data manager and statistician for the CHAI project. Maëlle has contributed six packages to rOpenSci to date, and has written about…
Original Post: Introducing Maëlle Salmon, rOpenSci’s new Research Software Engineer

Year 2 of Locke Data

Hey folks, I wanted to give y’all an update about Locke Data one year on from when I started it up.In the past year, I’ve delivered more than 32 days of training, wrote and published 2 books, worked with 3 clients, and generally whimpered at my schedule. It has been amazing how much support the community has given me, and I’ve tried to give back where possible by giving away books each month, doing the usual presenting, holding free office hours, and offering community workshops. Related To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop,…
Original Post: Year 2 of Locke Data