PySpark SQL Cheat Sheet: Big Data in Python

[unable to retrieve full-text content]PySpark is a Spark Python API that exposes the Spark programming model to Python – With it, you can speed up analytic applications. With Spark, you can get started with big data processing, as it has built-in modules for streaming, SQL, machine learning and graph processing.
Original Post: PySpark SQL Cheat Sheet: Big Data in Python

Actionable Insights: Obliterating BI, Data Warehousing as We Know It

[unable to retrieve full-text content]There is a big demand of quick insights or real time analytics from business side. But traditional BI or data warehouse architectures lack this realtime functionality. Here we discuss realtime analytics architecture in details.
Original Post: Actionable Insights: Obliterating BI, Data Warehousing as We Know It

Caserta: Big Data Solutions Architects

[unable to retrieve full-text content]Seeking a Big Data Solutions Architect, to work in small agile teams to deliver innovative solutions with the latest in modern architecture across the US, and participate and contribute to industry thought leadership by attending conferences, speaking at events, blogging etc.
Original Post: Caserta: Big Data Solutions Architects

[Webinar] Data Science for Big Data with Anaconda Enterprise, Oct 19

[unable to retrieve full-text content]This Team Anaconda webinar, Oct 19, will demonstrate how easily the Anaconda Enterprise data science platform integrates with Hadoop and Spark clusters, giving your data scientists access to the libraries they need and empowering you to extract the most value from your Big Data.
Original Post: [Webinar] Data Science for Big Data with Anaconda Enterprise, Oct 19

Tutorial: Azure Data Lake analytics with R

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.  To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this…
Original Post: Tutorial: Azure Data Lake analytics with R

Announcing dplyrXdf 1.0

I’m delighted to announce the release of version 1.0.0 of the dplyrXdf package. dplyrXdf began as a simple (relatively speaking) backend to dplyr for Microsoft Machine Learning Server/Microsoft R Server’s Xdf file format, but has now become a broader suite of tools to ease working with Xdf files. This update to dplyrXdf brings the following new features: Support for the new tidyeval framework that powers the current release of dplyr Support for Spark and Hadoop clusters, including integration with the sparklyr package to process Hive tables in Spark Integration with dplyr to process SQL Server tables in-database Simplified handling of parallel processing for grouped data Several utility functions for Xdf and file management Workarounds for various glitches and unexpected behaviour in MRS and dplyr Spark, Hadoop and HDFS New in version 1.0.0 of dplyrXdf is support for Xdf files and datasets stored…
Original Post: Announcing dplyrXdf 1.0