Search Logger
Posts from:

Author Archive

Google at ACL 2011

7:00 pm - May 18, 2011 in Google Research Blog


The Annual Meeting of the Association for Computational Linguistics is one of the premier conferences for language and text technologies. Many employees at Google have strong roots in the community of researchers that attend this meeting, including many of our researchers working on machine translation and speech.

At this years conference, Google is particularly well represented. The General Chair is Dekang Lin and a few Googlers are serving as technical Area Chairs (in addition to the plethora of Googlers that reviewed papers for the conference). Google is also a Platinum Sponsor of ACL this year.

Research advances at Google can be seen throughout the conference’s technical content. Below is a complete list of Googler-authored or co-authored papers in the main conference. We want to give special emphasis to this year’s best paper award, given to “Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections” by CMU graduate student and Google intern Dipanjan Das and his internship advisor Slav Petrov. ACL is an extremely selective conference and this award speaks volumes to the importance of syntactic analysis and using bilingual corpora to project syntactic resources from resource rich languages (like English) to other languages. Congratulations Dipanjan and Slav!

Googlers are also involved in two of this year’s tutorials. Marius Pasca will present “Web Search Queries as a Corpus” and Kuzman Ganchev and his colleagues will teach about “Rich Prior Knowledge in Learning for Natural Language Processing”. Finally, Katja Filippova and her colleagues are running a workshop on “Monolingual Text-to-Text Generation”.

ACL will take place this year in Portland from June 19th to June 24th.

Papers by Googlers (a * indicates a paper that will be linked to after the conference):

Ranking Class Labels Using Query Sessions*
Marius Pasca

Fine-Grained Class Label Markup of Search Queries*
Joseph Reisinger and Marius Pasca

Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
Dipanjan Das and Slav Petrov

Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
Sameer Singh, Amarnag Subramanya, Fernando Pereira and Andrew McCallum

Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
Stefan Rüd, Massimiliano Ciaramita, Jens Müller and Hinrich Schütze

Beam-Width Prediction for Efficient Context-Free Parsing
Nathan Bodenstab, Aaron Dunlop, Keith Hall and Brian Roark

Language-independent compound splitting with morphological operations
Klaus Macherey, Andrew Dai, David Talbot, Ashok Popat and Franz Och

Model-Based Aligner Combination Using Dual Decomposition
John DeNero and Klaus Macherey

Binarized Forest to String Translation
Hao Zhang, Licheng Fang, Peng Xu and Xiaoyun Wu

Semi-supervised Latent Variable Models for Fine-grained Sentiment Analysis
Oscar Tackstrom and Ryan McDonald
 

Google Scribe: Now with automatic text for links and faster formatting options

1:00 pm - May 26, 2011 in Google Research Blog


Since Google Scribe's first release on Google Labs last year, we have been poring over your feedback and busy adding the top features you asked for. Today, we're excited to announce a new version of Google Scribe that brings more features to word processing.

Besides formatting, Google Scribe provides features that help you author high quality documents quickly:
  1. Automatic text for links
    Adding a hyperlink to your document has been a two-step process of choosing the link and the text to display for it. Google Scribe now makes it easier. Just paste or type any link into your document and Google Scribe will set an appropriate link text.

  2. Smart toolbar
    Do you repeatedly spend time reaching out to the toolbar to format your document ? To speed-up formatting, Google Scribe now displays an abridged toolbar close-by when you select a portion of the document.



  3. Text completion in 12 languages
    Google Scribe auto-completes text as you type. In addition to saving keystrokes, the suggestions indicate correct or popular phrases to use. Google Scribe now auto-detects document language, so you no longer need to choose a language.


    You can view other applicable suggestions by clicking on the options button next to the Google Scribe icon and choosing “Show Multiple Suggestions”.


    We have extended auto-complete support to Arabic, Dutch, French, German, Hungarian, Italian, Polish, Portuguese, Russian, Spanish and Swedish in addition to English that we already supported.

  4. Correct your document as you type
    Google Scribe now has basic support for checking spelling, punctuation and phrases in your document. Google Scribe underlines incorrect usage and clicking on underlined words or phrases will display a menu of suggested corrections to choose from.


    We are continuously working on expanding the list of proofreading features. Stay tuned.
Try out the new Google Scribe at scribe.googlelabs.com and let us know what you think.
 

Instant Mix for Music Beta by Google

4:16 pm - June 8, 2011 in Google Research Blog


Music Beta by Google was announced at the Day One Keynote of Google I/O 2011. This service allows users to stream their music collections from the cloud to any supported device, including a web browser. It’s a first step in creating a platform that gives users a range of compelling music experiences. One key component of the product, Instant Mix, is a playlist generator developed by Google Research. Instant Mix uses machine hearing to extract attributes from audio which can be used to answer questions such as “Is there a Hammond B-3 organ?” (instrumentation / timbre), “Is it angry?” (mood), “Can I jog to it?” (tempo / meter) and so on. Machine learning algorithms relate these audio features to what we know about music on the web, such as the fact that Jimmy Smith is a jazz organist or that Arcade Fire and Wolf Parade are similar artists. From this we can predict similar tracks for a seed track and, with some additional sequencing logic, generate Instant Mix playlists from songs in a user’s locker.

Because we combine audio analysis with information about which artists and albums go well together, we can use both dimensions of similarity to compare songs. If you pick a mellow track from an album, we will make a mellower playlist than if you pick a high energy track from the same album. For example, here we compare short Instant Mixes made from two very different tracks by U2. The first Instant Mix comes from "Mysterious Ways," an upbeat, danceable track from Achtung Baby with electric guitar and heavy percussion.


  1. U2 "Mysterious Ways"
  2. David Bowie "Fame"
  3. Oingo Boingo "Gratitude"
  4. Infectious Grooves “Spreck”
  5. Red Hot Chili Peppers “Special Secret Song Inside”
Compare this to a short Instant Mix made from a much more laid back U2 cut, "MLK" from the album Unforgettable Fire. This track has delicate vocals on top of a sparse synthesizer background and no percussion.


  1. U2 "MLK"
  2. Jewel “Don’t”
  3. Antony and the Johnsons “What Can I Do?”
  4. The Beatles “And I Love Her”
  5. Van Morrison “Crazy Love”
As you can hear, the “Mysterious Ways” Instant Mix is funky, with strong percussion and high-energy vocals while the “MLK” mix carries on with that track's laid-back lullaby feeling.

Our approach also allows us to create mixes from music in the long tail. Are you the lead singer in an unknown Dylan cover band? Even if your group is new or otherwise unknown, Instant Mix can still use audio similarity to match your tracks to real Dylan tracks (provided, of course, that you sing like Bob and your band sounds like The Band).

Our goal with Instant Mix is to build awesome playlists from your music collection. We achieve this by using machine learning to blend a wide range of information sources, including features derived from the music audio itself. Though we’re still in beta, and still have a lot of work to do, we believe Instant Mix is a great tool for music discovery that stands out from the crowd. Give it a try!

Further reading by Google Researchers:
Machine Hearing: An Emerging Field
Richard F. Lyon.

Sound Ranking Using Auditory Sparse-Code Representations
Martin Rehn, Richard F. Lyon, Samy Bengio, Thomas C. Walters, Gal Chechik.

Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces
Jason Weston, Samy Bengio, Philippe Hamel.
 

Our first round of Google Research Awards for 2011

8:00 am - June 9, 2011 in Google Research Blog


We’ve just finished awarding the latest round of Google Research Awards, which provide funding to full-time faculty working on research in areas of mutual interest with Google. A record number of submissions came in this round, and we are delighted to be funding 112 awards across 21 different focus areas for a total of more than $6.75 million. The subject areas that received the highest level of support were systems and infrastructure, human computer interaction, Geo/maps and machine learning. Thanks to strong international collaborations, 23% of the funding in this round was awarded to universities outside the U.S.

In prior years, we’ve used this blog post to highlight some of our top-ranked projects, but this year, we’d like to give you an inside look into how we determine the award recipients.

Designating the awards involves a careful and detailed review process. First, we have a set of internal research leads, each a well-known expert in their field, review all the proposals in their area. They assess the proposals on merit, innovation, connection to Google’s products and services and fit with our overall research agenda. The research leads then assign several volunteer reviewers—culled from experts on their team or other Google engineers holding PhDs—to weigh each proposal.

All these reviews are recorded in an internal grant administration system, and the research leads make their funding recommendations. These recommendations are aggregated and a series of committee meetings are run, one for each research area. The research lead attends, along with members of the university relations team and executives in research. This committee reviews each proposal that the research lead has recommended for funding, using the same criteria mentioned above. This additional review process may change the proposal rankings and sometimes brings back other proposals for reconsideration.

Once the committee meetings are complete, we make the final funding decisions, which are based on the available budget and balancing the funding across research areas and geographic regions. The final decisions are reviewed one last time by research management, and then we distribute the awards to the selected faculty.

As the number of submissions for these research awards continues to grow, we remain committed to a merit-based review process with effective checks and balances. Congratulations to the well-deserving recipients of this round’s awards, and if you are interested in applying for the next round (deadline is August 1), please visit our website for more information.
 

Google at CVPR 2011

4:00 pm - June 16, 2011 in Google Research Blog


The computer vision community will get together in Colorado Springs the week of June 20th for the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2011). This year will see a record number of people attending the conference and 27 co-located workshops and tutorials. The registration was closed at 1500 attendees even before the conference started.

Computer Vision is at the core of many Google products, such as Image Search, YouTube, Street View, Picasa, and Goggles, and as always, Google is involved in several ways with CVPR. Andrew Senior is serving as an area chair of CVPR 2011 and many Googlers are reviewers. Googlers also co-authored these papers:


If you are attending the conference, stop by Google’s exhibition booth. In addition to talking with Google researchers, you will get to see examples of exciting computer vision research that has made it into Google products including, among others, the following:

  • Google Earth Facade Shadow Removal by Mei Han, Vivek Kwatra, and Shengyang Dai
    We will demonstrate our technique for removing shadows and other lighting/texture artifacts from building facades in Google Earth. We obtain cleaner, clearer, and more uniform textures which provide users with an improved visual experience.
  • Video Stabilization on YouTube Editor by Matthias Grundmann, Vivek Kwatra, and Irfan Essa
    Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. In contrast, professionally shot video usually employs stabilization equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our technique mimics these cinematographic principles, by optimally dividing the original, shaky camera path into a set of segments and approximating each with either constant, linear or parabolic motion using a computationally efficient and stable algorithm. We will showcase a live version of our algorithm, featuring real-time performance and interactive control, which is publicly available at youtube.com/editor.
  • Tag Suggest for YouTube by George Toderici and Mehmet Emre Sargin
    YouTube offers millions of users the opportunity to upload videos and share them with their friends. Many users would love to have their videos discoverable but don't annotate them properly. One new feature on YouTube that seeks to address this problem is tag prediction based on video content and independently based on text metadata.

6/17/2011 UPDATE: "Posted by" was changed to include Sergey Ioffe.
 

Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths

10:00 am - June 20, 2011 in Google Research Blog


Earlier this year, we announced the launch of new features on the YouTube Video Editor, including stabilization for shaky videos, with the ability to preview them in real-time. The core technology behind this feature is detailed in this paper, which will be presented at the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2011).

Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. Existing in-camera stabilization methods dampen high-frequency jitter but do not suppress low-frequency movements and bounces, such as those observed in videos captured by a walking person. On the other hand, most professionally shot videos usually consist of carefully designed camera configurations, using specialized equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our goal was to devise a completely automatic method for converting casual shaky footage into more pleasant and professional looking videos.



Our technique mimics the cinematographic principles outlined above by automatically determining the best camera path using a robust optimization technique. The original, shaky camera path is divided into a set of segments, each approximated by either a constant, linear or parabolic motion. Our optimization finds the best of all possible partitions using a computationally efficient and stable algorithm.

To achieve real-time performance on the web, we distribute the computation across multiple machines in the cloud. This enables us to provide users with a real-time preview and interactive control of the stabilized result. Above we provide a video demonstration of how to use this feature on the YouTube Editor. We will also demo this live at Google’s exhibition booth in CVPR 2011.

For further details, please read our paper.
 

Google Translate welcomes you to the Indic web

11:30 am - June 21, 2011 in Google Research Blog


(Cross-posted on the Translate Blog and the Official Google Blog)



 

Beginning today, you can explore the linguistic diversity of the Indian sub-continent with Google Translate, which now supports five new experimental alpha languages: Bengali, Gujarati, Kannada, Tamil and Telugu. In India and Bangladesh alone, more than 500 million people speak these five languages. Since 2009, we’ve launched a total of 11 alpha languages, bringing the current number of languages supported by Google Translate to 63.

Indic languages differ from English in many ways, presenting several exciting challenges when developing their respective translation systems. Indian languages often use the Subject Object Verb (SOV) ordering to form sentences, unlike English, which uses Subject Verb Object (SVO) ordering. This difference in sentence structure makes it harder to produce fluent translations; the more words that need to be reordered, the more chance there is to make mistakes when moving them. Tamil, Telugu and Kannada are also highly agglutinative, meaning a single word often includes affixes that represent additional meaning, like tense or number. Fortunately, our research to improve Japanese (an SOV language) translation helped us with the word order challenge, while our work translating languages like German, Turkish and Russian provided insight into the agglutination problem.

You can expect translations for these new alpha languages to be less fluent and include many more untranslated words than some of our more mature languages—like Spanish or Chinese—which have much more of the web content that powers our statistical machine translation approach. Despite these challenges, we release alpha languages when we believe that they help people better access the multilingual web. If you notice incorrect or missing translations for any of our languages, please correct us; we enjoy learning from our mistakes and your feedback helps us graduate new languages from alpha status. If you’re a translator, you’ll also be able to take advantage of our machine translated output when using the Google Translator Toolkit.

Since these languages each have their own unique scripts, we’ve enabled a transliterated input method for those of you without Indian language keyboards. For example, if you type in the word “nandri,” it will generate the Tamil word நன்றி (see what it means). To see all these beautiful scripts in action, you’ll need to install fonts* for each language.

We hope that the launch of these new alpha languages will help you better understand the Indic web and encourage the publication of new content in Indic languages, taking us five alpha steps closer to a web without language barriers.

*Download the fonts for each language: Tamil, Telugu, Bengali, Gujarati and Kannada.
 

Languages of the World (Wide Web)

7:15 pm - July 7, 2011 in Google Research Blog


The web is vast and infinite. Its pages link together in a complex network, containing remarkable structures and patterns. Some of the clearest patterns relate to language.

Most web pages link to other pages on the same web site, and the few off-site links they have are almost always to other pages in the same language. It's as if each language has its own web which is loosely linked to the webs of other languages. However, there are a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.

To see the connections between languages, start by taking the several billion most important pages on the web in 2008, including all pages in smaller languages, and look at the off-site links between these pages. The particular choice of pages in our corpus here reflects decisions about what is `important'. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.

We can use our corpus to draw a very simple graph of the web, with a node for each language and an edge between two languages if more than one percent of the offsite links in the first language land on pages in the second. To make things a little clearer, we only show the languages which have at least a hundred thousand pages and have a strong link with another language, meaning at least 1% of off-site links go to that language. We also leave out English, which we'll discuss more in a moment. (Figure 1)

Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia.
The language linkages invite explanations around geopolitics, linguistics, and historical associations.


Figure 1: Language links on the web. 

The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.

Examining links between other languages, it seems that many are explained by people and communities which speak both languages.

The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.

The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France.

Another strong link is between the Malay/Malaysian and Indonesian webs. Malaysia and Indonesia share a border, but more importantly the languages are nearly eighty percent cognate, meaning speakers of one can easily understand the other.

What about the sizes of each language web? Both the number of sites in each language and the number of urls seen by Google's crawler follow an exponential distribution, although the ordering for each is slightly different (Figure 2). The exact number of pages in each language in 2008 is unknown, since multiple urls may point to the same page and some pages may not have been seen at all. However, the language of an un-crawled url can be guessed by the dominant language of its site. In fact, calendar pages and other infinite spaces mean that there really are an unlimited number of pages on the web, though some are more useful than others.

Figure 2: The number of sites and seen urls per language are roughly exponentially distributed. 

The largest language on the web, in terms of size and centrality, has always been English, but where is it on our map?

Every language on the web has strong links to English, usually with around twenty percent of offsite links and occasionally over forty five percent, such as from Tagalog/Filipino, spoken in the Philippines, and Urdu, principally spoken in Pakistan (Figure 3). Both the Philippines and Pakistan are former British colonies where English is one of the two official languages.

Figure 3: Language links to and from English 

You might wonder whether off-site links landing on English pages can be explained simply by the number of English pages available to be linked to. The webs of other languages in our corpus typically have sixty to eighty percent of their out-language links to English pages. However, only 38 percent of the pages and 42 percent of sites in our set are English, while it attracts 79 percent of all out-language links from other languages.

Chinese and Japanese also seem unusual because there are relatively few links from pages in these languages to pages in English. This is despite the fact that Japanese and Chinese sites are the most popular non-English sites for English sites to link to. However, the number of sites in a language is a strong predictor of its `introversion', or fraction of off-site links to pages in the same language. Taking this into account shows that Chinese and Japanese webs are not unusually introverted given their size. In general, language webs with more sites are more introverted, perhaps due to better availability of content. (Figure 4)

Figure 4: Language size vs introversion. 

There is a roughly linear relationship between the (log) number of sites in a language and the fraction of off-site links which point to pages in the same language, with a correlation of 0.9 if English is removed. However, only 45 percent of off-site links from English pages are to other English pages, making English the most extroverted web language given its size. Other notable outliers are the Hindi web, which is unusually introverted, and the Tagalog and Malay webs which are unusually extroverted.

We can generate another map by connecting languages if the number of links from one to the other is 50 times greater than expected given the number of out-of-language links and the size of the language linked to (Figure 5). This time, the native languages of India show up clearly. Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.

Figure 5: Unexpected connections, given the size of each language. 

What's happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked. We hope that tools like Google page translation, voice translation, and other services will accelerate this process and bring more people in the world closer together, whichever languages they speak.


UPDATE 9 July 2011: As has been pointed out in the comments, in both the Philippines and Pakistan, English is one of the two official languages; however, the Philippines was not a British colony.
 

What You Capture Is What You Get: A New Way for Task Migration Across Devices

4:45 pm - July 12, 2011 in Google Research Blog


We constantly move from one device to another while carrying out everyday tasks. For example, we might find an interesting article on a desktop computer at work, then bring the article with us on a mobile phone during the commute and keep reading it on a laptop or a TV when we get home. Cloud computing and web applications have made it possible to access the same data and applications on different devices and platforms. However, there are not many ways to easily move tasks across devices that are as intuitive as drag-and-drop in a graphical user interface.

Since last year, our research team started developing new technologies for users to easily migrate their tasks across devices. In a project named Deep Shot, we demonstrated how a user can easily move web pages and applications, such as Google Maps directions, between a laptop and an Android phone by using the phone camera. With Deep Shot, a user can simply take a picture of their monitor with a phone camera, and the captured content automatically shows up and becomes instantly interactive on the mobile phone.

This project was inspired by our observations that many people tend to take a picture of map directions on the monitor using their mobile phone camera, rather than using other approaches such as email. Taking pictures feels more direct and convenient, and fits well our everyday activity that is often more opportunistic. Instead of just capturing raw pixels, Deep Shot recovers the actual contents and applications on the mobile phone based on these pixels. You can find out how Deep Shot keeps user interaction simple and what happens behind the scenes here. Similar to WYSIWYG—What You See Is What You Get—for graphical user interfaces, Deep Shot demonstrates WYCIWYG—What You Capture Is What You Get—for cross-device interaction. We are exploring this interaction style for various task migration situations in our everyday life.



Deep Shot remains a research project at Google. With increasing capabilities of mobile phones and fast growing web applications, we hope to explore more exciting ways to help users carry out their everyday activities.
 

Studies Show Search Ads Drive 89% Incremental Traffic

10:10 am - July 21, 2011 in Google Research Blog


Advertisers often wonder whether search ads cannibalize their organic traffic. In other words, if search ads were paused, would clicks on organic results increase, and make up for the loss in paid traffic? Google statisticians recently ran over 400 studies on paused accounts to answer this question.

In what we call “Search Ads Pause Studies”, our group of researchers observed organic click volume in the absence of search ads. Then they built a statistical model to predict the click volume for given levels of ad spend using spend and organic impression volume as predictors. These models generated estimates for the incremental clicks attributable to search ads (IAC), or in other words, the percentage of paid clicks that are not made up for by organic clicks when search ads are paused.

The results were surprising. On average, the incremental ad clicks percentage across verticals is 89%. This means that a full 89% of the traffic generated by search ads is not replaced by organic clicks when ads are paused. This number was consistently high across verticals. The full study can be found on here.
 
 
 
 
 
 
It's All About Search | © clsc.net |
2012.05.1823:00
Tech used here: Valid HTML - Valid CSS - Valid RSS - JavaScript - PHP - Smarty - MySQL - and a partridge in a pear tree.