R⁶ — Exploring macOS Applications with codesign, Gatekeeper & R

(General reminder abt “R⁶” posts in that they are heavy on code-examples, minimal on expository. I try to design them with 2-3 “nuggets” embedded for those who take the time to walk through the code examples on their systems. I’ll always provide further expository if requested in a comment, so don’t hesitate to ask if something is confusing.) I had to check something on the macOS systems across the abode today and — on a lark — decided to do all the “shell” scripting in R vs bash for a change. After performing the tasks, it occurred to me that not all R users on macOS realize there are hidden gems of information spread across the “boring” parts of the filesystem in SQLite databases. So, I put together a small example that: identifies all the top-level apps in /Applications extracts…
Original Post: R⁶ — Exploring macOS Applications with codesign, Gatekeeper & R

R⁶ — Reticulating Parquet Files

The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R using reticulate as a bridge to the pyarrow module, which has the ability to natively create parquet files. Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill. The Python parquet process is pretty simple since you can convert a pandas DataFrame directly to a pyarrow Table which can be written out in parquet format with…
Original Post: R⁶ — Reticulating Parquet Files

Analyzing “Wait-Delay” Settings in Common Crawl robots.txt Data with R

One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats // Ethics in Web Scraping https://t.co/y5YxvzB8Fd — boB Rudis (@hrbrmstr) July 26, 2017 If you load that up that tweet and follow the thread, you’ll see a really good question by @kennethrose82 regarding what an appropriate setting should be for a delay between crawls. The answer is a bit nuanced as there are some written and unwritten “rules” for those who would seek to scrape web site content. For the sake of brevity in this post, we’ll only focus on “best practices” (ugh) for being kind to web site resources when it comes to timing requests, after a quick mention that “Step 0”…
Original Post: Analyzing “Wait-Delay” Settings in Common Crawl robots.txt Data with R

Reading PCAP Files with Apache Drill and the sergeant R Package

It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant package that rests atop d[b]plyr. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.Why am I calling this out? Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from…
Original Post: Reading PCAP Files with Apache Drill and the sergeant R Package

R⁶ — General (Attys) Distributions

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice): Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A daily viz: https://t.co/aJ4KDsC5kC pic.twitter.com/ZoiEV3MhGp — Matt Stiles (@stiles) July 25, 2017 I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls. Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis. It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu. The following code: grabs…
Original Post: R⁶ — General (Attys) Distributions

Ten-HUT! The Apache Drill R interface package — sergeant — is now on CRAN

I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon. sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t work “out of the box” with sergeant‘s dplyr interface, file an issue and I’ll make a special one for it in the package. I’ve written about drill on the blog before so check out those posts for some history and stay tuned for more examples. The README should get you started using sergeant and/or Drill (if you aren’t running Drill now, take a look and you’ll likely get hooked). I’d like to take a moment to call out special thanks to Edward Visel for…
Original Post: Ten-HUT! The Apache Drill R interface package — sergeant — is now on CRAN

R⁶ — Disproving Approval

I couldn’t let this stand unchallenged: The new Rasmussen Poll, one of the most accurate in the 2016 Election, just out with a Trump 50% Approval Rating.That’s higher than O’s #’s! — Donald J. Trump (@realDonaldTrump) June 18, 2017 Ramussen makes their Presidential polling data available for both 🍊 & O. Why not compare their ratings from day 1 in office (skipping days that Ramussen doesn’t poll)? library(hrbrthemes) library(rvest) library(tidyverse) list( Obama=”http://m.rasmussenreports.com/public_content/politics/obama_administration/obama_approval_index_history”, Trump=”http://m.rasmussenreports.com/public_content/politics/trump_administration/trump_approval_index_history” ) %>% map_df(~{ read_html(.x) %>% html_table() %>% .[[1]] %>% tbl_df() %>% select(date=Date, approve=Total Approve, disapprove=Total Disapprove) }, .id=”who”) -> ratings mutate_at(ratings, c(“approve”, “disapprove”), function(x) as.numeric(gsub(“%”, “”, x, fixed=TRUE))/100) %>% mutate(date = lubridate::dmy(date)) %>% filter(!is.na(approve)) %>% group_by(who) %>% arrange(date) %>% mutate(dnum = 1:n()) %>% ungroup() %>% ggplot(aes(dnum, approve, color=who)) + geom_hline(yintercept = 0.5, size=0.5) + geom_point(size=0.25) + scale_y_percent(limits=c(0,1)) + scale_color_manual(name=NULL, values=c(“Obama”=”#313695”, “Trump”=”#a50026″)) + labs(x=”Day in office”, y=”Approval…
Original Post: R⁶ — Disproving Approval

Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files. If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out…
Original Post: Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant

Keeping Users Safe While Collecting Data

I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap. The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words. As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/. Why would the goal of the project combined with that URL give pause? Read on! You’ve Got Scams! Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your…
Original Post: Keeping Users Safe While Collecting Data

Engaging the tidyverse Clean Slate Protocol

I caught the 0.7.0 release of dplyr on my home CRAN server early Friday morning and immediately set out to install it since I’m eager to finish up my sergeant package and get it on CRAN. “Tidyverse” upgrades aren’t trivial for me as I tinker quite a bit with the tidyverse and create packages that depend on various components. The sergeant package provides — amongst other things — a dplyr back-end for Apache Drill, so it has more tidyverse tendrils than other bits of code I maintain. macOS binaries weren’t available yet (it generally takes 24-48 hrs for that) so I did an install.packages(“dplyr”, type=”source”) and was immediately hit with gcc 7 compilation errors. This seemed odd, but switching back to clang worked fine. I, then, proceeded to run chunks in an Rmd I’m working on and hit “Encoding”…
Original Post: Engaging the tidyverse Clean Slate Protocol