R⁶ — Disproving Approval

I couldn’t let this stand unchallenged: The new Rasmussen Poll, one of the most accurate in the 2016 Election, just out with a Trump 50% Approval Rating.That’s higher than O’s #’s! — Donald J. Trump (@realDonaldTrump) June 18, 2017 Ramussen makes their Presidential polling data available for both 🍊 & O. Why not compare their ratings from day 1 in office (skipping days that Ramussen doesn’t poll)? library(hrbrthemes) library(rvest) library(tidyverse) list( Obama=”http://m.rasmussenreports.com/public_content/politics/obama_administration/obama_approval_index_history”, Trump=”http://m.rasmussenreports.com/public_content/politics/trump_administration/trump_approval_index_history” ) %>% map_df(~{ read_html(.x) %>% html_table() %>% .[[1]] %>% tbl_df() %>% select(date=Date, approve=Total Approve, disapprove=Total Disapprove) }, .id=”who”) -> ratings mutate_at(ratings, c(“approve”, “disapprove”), function(x) as.numeric(gsub(“%”, “”, x, fixed=TRUE))/100) %>% mutate(date = lubridate::dmy(date)) %>% filter(!is.na(approve)) %>% group_by(who) %>% arrange(date) %>% mutate(dnum = 1:n()) %>% ungroup() %>% ggplot(aes(dnum, approve, color=who)) + geom_hline(yintercept = 0.5, size=0.5) + geom_point(size=0.25) + scale_y_percent(limits=c(0,1)) + scale_color_manual(name=NULL, values=c(“Obama”=”#313695”, “Trump”=”#a50026″)) + labs(x=”Day in office”, y=”Approval…
Original Post: R⁶ — Disproving Approval

Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files. If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out…
Original Post: Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant

Keeping Users Safe While Collecting Data

I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap. The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words. As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/. Why would the goal of the project combined with that URL give pause? Read on! You’ve Got Scams! Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your…
Original Post: Keeping Users Safe While Collecting Data

Engaging the tidyverse Clean Slate Protocol

I caught the 0.7.0 release of dplyr on my home CRAN server early Friday morning and immediately set out to install it since I’m eager to finish up my sergeant package and get it on CRAN. “Tidyverse” upgrades aren’t trivial for me as I tinker quite a bit with the tidyverse and create packages that depend on various components. The sergeant package provides — amongst other things — a dplyr back-end for Apache Drill, so it has more tidyverse tendrils than other bits of code I maintain. macOS binaries weren’t available yet (it generally takes 24-48 hrs for that) so I did an install.packages(“dplyr”, type=”source”) and was immediately hit with gcc 7 compilation errors. This seemed odd, but switching back to clang worked fine. I, then, proceeded to run chunks in an Rmd I’m working on and hit “Encoding”…
Original Post: Engaging the tidyverse Clean Slate Protocol

R⁶ — Scraping Images To PDFs

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but the format is terrible and there’s no easy, in-browser way to download them all. CNN has ToS that prevent automated data gathering from CNN-proper. But, they used Adobe Document Cloud for these images which has no similar restrictions from a quick glance at their ToS. That means you get an R⁶ post on how to grab the individual 38 images and combine them into one PDF. I did this all with the hopes of OCRing the text, which has not panned out too well since…
Original Post: R⁶ — Scraping Images To PDFs

Drilling Into CSVs — Teaser Trailer

I used reading a directory of CSVs as the foundational example in my recent post on idioms. During my exchange with Matt, Hadley and a few others — in the crazy Twitter thread that spawned said post — I mentioned that I’d personally “just use Drill”. I’ll use this post as a bit of a teaser trailer for the actual post (or, more likely, series of posts) that goes into detail on where to get Apache Drill, basic setup of Drill for standalone workstation use and then organizing data with it. You can get ahead of those posts by doing two things: Download, install and test your Apache Drill setup (it’s literally 10 minutes on any platform) Review the U.S. EPA annual air quality data archive (they have individual, annual CSVs that are perfect for the example) My goals for…
Original Post: Drilling Into CSVs — Teaser Trailer

L.A. Unconf-idential : a.k.a. an rOpenSci #runconf17 Retrospective

Last year, I was able to sit back and lazily “RT” Julia Silge’s excellent retrospective on her 2016 @rOpenSci “unconference” experience. Since Julia was not there this year, and the unconference experience is still in primary storage (LMD v2.0 was a success!) I thought this would be the perfect time for a mindful look-back. And Now, A Word From… Hosting a conference is an expensive endeavour. These organizations made the event possible: At most “conferences” you are inundated with advertising from event sponsors. These folks provided resources and said “do good work”. That makes them all pretty amazing but is also an indicator of the awesomeness of this particular unconference. All For “Un” and “Un” For All Over the years, I’ve become much less appreciative of “talking heads” events. Don’t get me wrong. There’s great benefit in being part of…
Original Post: L.A. Unconf-idential : a.k.a. an rOpenSci #runconf17 Retrospective

R⁶ — Idiomatic (for the People)

NOTE: I’ll do my best to ensure the next post will have nothing to do with Twitter, and this post might not completely meet my R⁶ criteria. A single, altruistic, nigh exuberant R tweet about slurping up a directory of CSVs devolved quickly — at least in my opinion, and partly (sadly) with my aid — into a thread that ultimately strayed from a crucial point: idiomatic is in the eye of the beholder. I’m not linking to the twitter thread, but there are enough folks with sufficient Klout scores on it (is Klout even still a thing?) that you can easily find it if you feel so compelled. I’ll take a page out of the U.S. High School “write an essay” playbook and start with a definition of idiomatic: using, containing, or denoting expressions that are natural to a…
Original Post: R⁶ — Idiomatic (for the People)

A Very Palette-able Post

Many of my posts seem to begin with a link to a tweet, and this one falls into that pattern: And @_inundata is already working on a #rstats palette. https://t.co/bNfpL7OmVl — Timothée Poisot (@tpoi) May 21, 2017 I’d seen the Ars Tech post about the named color palette derived from some training data. I could tell at a glance of the resultant palette: that it would not be ideal for visualizations (use this site test the final image in this post and verify that on your own) but this was a neat, quick project to take on, especially since it let me dust off an old GH package, adobecolor and it was likely I could beat Karthik to creating a palette 😉 The “B+” goal is to get a color palette that “matches” the one in the Tumlbr post. The…
Original Post: A Very Palette-able Post

R⁶ — Using R With Amazon Athena & AWS Temporary Security Credentials

Most of the examples of working with most of the AWS services show basic username & password authentication. That’s all well-and-good, but many shops use the AWS Security Token Service to provide temporary credentials and session tokens to limit exposure and provide more uniform multi-factor authentication. At my workplace, Frank Mitchell created a nice electron app to make it super easy to create and re-up these credentials. The downside of this is that all AWS service usage for work requires using these credentials and I was having the darndest time trying to get Athena’s JDBC driver working with it (but I wasn’t spending alot of time on it as I tend to mirror research data to a local, beefy Apache Drill server). I finally noticed the com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider class and decided to give the following a go (you will need to…
Original Post: R⁶ — Using R With Amazon Athena & AWS Temporary Security Credentials