A Call to Tweets (& Blog Posts)!

Way back in July of 2009, the first version of the twitteR package was published by Geoff Jentry in CRAN. Since then it has seen 28 updates, finally breaking the 0.x.y barrier into 1.x.y territory in March of 2013 and receiving it’s last update in July of 2015. For a very long time, the twitteR package was the way to siphon precious nuggets of 140 character data from that platform and is the top hit when one searches for r twitter package. It even ha[sd] it’s own mailing list and is quite popular, judging by RStudio’s CRAN logs total downloads stats . I blog today to suggest there is a better way to work with Twitter data from R, especially if your central use-case is searching Twitter and mining tweet data. This new way is rtweet by Michael Kearney. It…
Original Post: A Call to Tweets (& Blog Posts)!

Enabling Concerned Visitors & Ethical Security Researchers with security.txt Web Security Policies (plus analyze them at-scale with R)

I’ve blogged a bit about robots.txt — the rules file that documents a sites “robots exclusion” standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it’s not mandatory and often ignored by crawlers and content owners alike. There’s an emerging IETF draft for a different type of site metadata that content owners should absolutely consider adopting. This one defines “web security policies” for a given site and has much in common with robots exclusion standard, including the name (security.txt) and format (policy directives are defined with simple syntax — see Chapter 5 of the Debian Policy Manual). One core difference is that this file is intended for humans. If you are are a general user and visit a…
Original Post: Enabling Concerned Visitors & Ethical Security Researchers with security.txt Web Security Policies (plus analyze them at-scale with R)

Retrieve & process TV News chyrons with newsflash

The Internet Archive recently announced a new service they’ve dubbed ‘Third Eye’. This service scrapes the chyrons that annoyingly scroll across the bottom-third of TV news broadcasts. IA has a vast historical archive of TV news that they’ll eventually process, but — for now — the more recent broadcasts from four channels are readily available. There’s tons of information about the project on its main page where you can interactively work with the API if that’s how you roll. Since my newsflash🔗 package already had a “news” theme and worked with the joint IA-GDELT project TV data, it seemed to be a good home for a Third Eye interface to live. Basic usage You can read long-form details of the Third Eye service on their site. The TLDR is that they provide two feeds: a “raw” one which has…
Original Post: Retrieve & process TV News chyrons with newsflash

Identify & Analyze Web Site Tech Stacks With rappalyzer

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities), cache managers and a wide array of front-end web components. Unless a site goes to great lengths to cloak what they are using, most of these stack components leave a fingerprint — bits and pieces that we can piece together to identify them. Wappalyzer is one tool that we can use to take these fingerprints and match them against a database of known components. If you’re not familiar with that service, go there now and enter in the URL of your own blog or…
Original Post: Identify & Analyze Web Site Tech Stacks With rappalyzer

SODD — StackOverflow Driven-Development

I occasionally hang out on StackOverflow and often use an answer as an opportunity to fill a package void for a particular need. docxtractr and qrencoder are two (of many) packages that were birthed from SO answers. I usually try to answer with inline code first then expand the functionality into a package (if warranted). Some make it to CRAN (like those two), others stay on GitHub. This (short) post is about two new ones: webhose🔗 and pigeon🔗. The webhose package is an API interface package to https://webhose.io/, which is an interesting service that scrapes the web & “dark web” and provides a short but handy API to retrieving the content using a fairly intuitive query language. The pigeon package is a hastily-hacked-together wrapper around pgn-extract, a cross-platform utility written in C for working with chess game data in…
Original Post: SODD — StackOverflow Driven-Development

Speeding Up Digital Arachnids

spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It’s got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar. (Check the end of the post if you don’t recognize the lyrical riff.) Face front, true believer! I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper. But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post). I needed something faster for my bulk Crawl-Delay analysis which led me to this small,…
Original Post: Speeding Up Digital Arachnids

Pirating Web Content Responsibly With R

International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back. There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code. We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters. Scouring The Seas Web For Pirate Data Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International…
Original Post: Pirating Web Content Responsibly With R

Mapping Fall Foliage with sf

I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet: Where to see the best fall foliage, based on your location: https://t.co/12pQU29ksB pic.twitter.com/JiywYVpmno — Vox (@voxdotcom) September 18, 2017 Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild #rstats web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick. I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments! Full RStudio project file (with pre-cached data) is on GitHub. library(rprojroot) library(sf) library(magick) library(tidyverse) # NOTE: Needs github version of ggplot2 root <- find_rstudio_root_file() # “borrow” the files from SmokyMountains.com, but be nice and cache them…
Original Post: Mapping Fall Foliage with sf

It’s a FAKE (📦)! Revisiting Trust In FOSS Ecosystems

I’ve blathered about trust before 12, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a when, which begged for further blathering on a recent FOSS ecosystem cybersecurity incident. The gg_spiffy @thomasp85 linked to a post by the SK-CSIRT detailing the discovery and take-down of a series of malicious Python packages. Here’s their high-level incident summary: SK-CSIRT identified malicious software libraries in the official Python packagerepository, PyPI, posing as well known libraries. A prominent example is a fakepackage urllib-1.21.1.tar.gz, based upon a well known packageurllib3-1.21.1.tar.gz. Such packages may have been downloaded by unwitting developer or administratorby various means, including the popular “pip” utility (pip install urllib).There is evidence that the fake packages have indeed been downloaded andincorporated into software multiple times between June 2017 and September 2017. Words are great but, unlike some other…
Original Post: It’s a FAKE (📦)! Revisiting Trust In FOSS Ecosystems

Revisiting Readability With RStudio

I’ve blogged about my in-development R package hgr a before and it’s slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we’ll take them in order. New S3 print() Method Objects created with hgr::just_the_facts() used to be just list objects. Technically they are still list objects but they are also classed as hgr objects. The main reason for this was to support the new default print() method. When you print() and hgr object, the new S3 method will extract the $content part of the object and pass it through some htmltools functions to display the readability-enhanced content in a browser (whatever R is configured to use for that on your system…you likely use RStudio if you read…
Original Post: Revisiting Readability With RStudio