Introducing the CVPR 2018 Learned Image Compression Challenge

Posted by Michele Covell, Research Scientist, Google ResearchImage compression is critical to digital photography — without it, a 12 megapixel image would take 36 megabytes of storage, making most websites prohibitively large. While the signal-processing community has significantly improved image compression beyond JPEG (which was introduced in the 1980’s) with modern image codecs (e.g., BPG, WebP), many of the techniques used in these modern codecs still use the same family of pixel transforms as are used in JPEG. Multiple recent Google projects improve the field of image compression with end-to-end with machine learning, compression through superresolution and creating perceptually improved JPEG images, but we believe that even greater improvements to image compression can be obtained by bringing this research challenge to the attention of the larger machine learning community.To encourage progress in this field, Google, in collaboration with ETH and…
Original Post: Introducing the CVPR 2018 Learned Image Compression Challenge

Tacotron 2: Generating Human-like Speech from Text

Posted by Jonathan Shen and Ruoming Pang, Software Engineers, on behalf of the Google Brain and Machine Perception TeamsGenerating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved. Incorporating ideas from past work such as Tacotron and WaveNet, we added more improvements to end up with our new system, Tacotron 2. Our approach does not use complex linguistic and acoustic features as input. Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.A full description of our new system can be found in our paper “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” In a nutshell it works…
Original Post: Tacotron 2: Generating Human-like Speech from Text

Introducing NIMA: Neural Image Assessment

Posted by Hossein Talebi, Software Engineer and Peyman Milanfar Research Scientist, Machine PerceptionQuantification of image quality and aesthetics has been a long-standing problem in image processing and computer vision. While technical quality assessment deals with measuring pixel-level degradations such as noise, blur, compression artifacts, etc., aesthetic assessment captures semantic level characteristics associated with emotions and beauty in images. Recently, deep convolutional neural networks (CNNs) trained with human-labelled data have been used to address the subjective nature of image quality for specific classes of images, such as landscapes. However, these approaches can be limited in their scope, as they typically categorize images to two classes of low and high quality. Our proposed method predicts the distribution of ratings. This leads to a more accurate quality prediction with higher correlation to the ground truth ratings, and is applicable to general images.In “NIMA:…
Original Post: Introducing NIMA: Neural Image Assessment

Improving End-to-End Models For Speech Recognition

Posted by Tara N. Sainath, Research Scientist, Speech Team and Yonghui Wu, Research Scientist, Google Brain TeamTraditional automatic speech recognition (ASR) systems, used for a variety of voice search applications at Google, are comprised of an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are independently trained, and often manually designed, on different datasets [1]. AMs take acoustic features and predict a set of subword units, typically context-dependent or context-independent phonemes. Next, a hand-designed lexicon (the PM) maps a sequence of phonemes produced by the acoustic model to words. Finally, the LM assigns probabilities to word sequences. Training independent components creates added complexities and is suboptimal compared to training all components jointly. Over the last several years, there has been a growing popularity in developing end-to-end systems, which attempt to learn these separate…
Original Post: Improving End-to-End Models For Speech Recognition

A Summary of the First Conference on Robot Learning

Posted by Vincent Vanhoucke, Principal Scientist, Google Brain Team and Melanie Saldaña, Program Manager, University RelationsWhether in the form of autonomous vehicles, home assistants or disaster rescue units, robotic systems of the future will need to be able to operate safely and effectively in human-centric environments. In contrast to to their industrial counterparts, they will require a very high level of perceptual awareness of the world around them, and to adapt to continuous changes in both their goals and their environment. Machine learning is a natural answer to both the problems of perception and generalization to unseen environments, and with the recent rapid progress in computer vision and learning capabilities, applying these new technologies to the field of robotics is becoming a very central research question.This past November, Google helped kickstart and host the First Conference on Robot Learning (CoRL)…
Original Post: A Summary of the First Conference on Robot Learning