Google and Bing: a story of non-stolen results for hiybbprqag

Posted by & filed under Politics, Technology.

Wow. I seriously miss something, because story with supposedly Bing stealing results from Google just doesn’t make much sense.

I’ve read the entry. I was also quite tempted to support Google’s “they’re cheating” scream but then I stopped and re-read the story more carefully.

So, Google engineers tested assumption that their search results somehow leak into Bing. To test, they worked with the long tail: added a made-up word with result showing exactly one irrelevant page, that couldn’t possibly be associated with that word through any normal means.
Then they give laptops with Windows, IE8 and Bing toolbar to engineers, who google for that funky word and click on the result. In a couple weeks Bing.com starts to respond with exactly the same page to that same made-up word. Voila, Bing must be stealing Google’s data!

Except there’s one very simple and easy explanation. Bing toolbar reports to Bing your “interactions with search engines”, including click-stream. And, as the test is on a word that had absolutely nothing matching it in the whole internet (until today, when search results for “hiybbprqag” are probably exploding, thanks to all the blog posts) Bing had one very relevant, important, and the only source for associating link and the magic word. A number of clicks from Google engineers, who look like regular users.

Of course privacy on, for example, Rewards explicitly says that search engine use is tracked:

In order to reward you for your participation in the Bing Rewards Preview, you need to download the Bing Bar which contains the Reward Counter. The Reward Counter collects information about your interaction with Bing and different search engines including the number of web searches you do each day, the types of searches you complete (such as for news or images), and the number of search ads you click on.

And I’m sure there are other, more vague (thank you, legalize-speaking lawyers) verbiages about how all your online wandering is used to improve Microsoft’s services, including that very same Bing search.

So, carefully crafted, clearly engineered, long-tail result gets associated with a fake page thanks to a click-stream data. Does that qualify as “stealing results”? Hardly, because the source is user interaction. Did engineers capture all of the net traffic? Can they say with 100% certainty “No data was transmitted between our laptop and any of Microsoft services that contained anything remotely seeming to link our magic word and a URL”? I haven’t seen it in their blog post. If they did include that aspect of data collection, then why didn’t they say so?

I guess other possible reasons for Google-Bing-Gate is that someone overreacted. “Bing results are too similar to us”, that someone thought, and abovementioned experiment “confirmed” their suspicion (though frankly I think the only confirmation would be if “magical” results appeared in Bing without any interaction by humans) and thus the blog post was born.

Maybe a mere thought of using click-stream to “add” to results was a heresy in the mind of any Googler (“We’ll be google-bombed into oblivion!” — except here there are no results to be bombed, remember?) Maybe the experiment was their attempt to prove that Google is the only engine that could possibly give relevant results?

I agree, Google is quite relevant in most cases, and I use it daily. But simple logic dictates that if other engines improve over time, all results will converge to some abstract “best”, regardless of how engines learn to get to that point. And if someone is watching click-stream, then the fact that users clicked on something on the google’s search results page would kinda imply “using google’s results”. Along with everything else.

So, the whole story is a bit too hysterical, a bit too stretched, and a bit nonsensical. I hope there will be a clarification about absolutely excluding click-stream/”user experience improvement program” from being a hint to Bing.com that blah-blah-unique-word should go to some specific URL. Pretty please? Also it would be interesting if they actually created two results, and everyone would click on only one. Would that be the one that appeared in Bing reply? Or both? Alas, we’ll never know, because even if this experiment was to be conducted now, this outcome would be written off to them “changing logic after accusation went public”. *sigh*

15 Responses to “Google and Bing: a story of non-stolen results for hiybbprqag”

  1. Jon

    I’ve programmed for years, including made a simple search engine. regardless of how advanced both search engines are they would have to be using the same exact code or nearly the same in order to get the exact same results for 100 different tests. Google’s been developing their algorithms for years. Bing is relatively new, it’s supposedly a complete overhaul of the MSN and Yahoo Search engines(Microsoft bought the Yahoo search rights and uses their search to aid Bing). Google’s system their pagerank system which they’ve never released to the public, for Microsoft to suddenly have the exact same system, and same results, it’s pretty fishy.

    Plus do you really think that Google would post on their blog something that big which is just asking for a lawsuit and further drama, without being as sure as they could be about it?

    Reply
    • Max Smolev

      Nearly the same and the same is not exactly the same thing. I believe there is more than one way to come to about same result:
      – click-stream – to see where users click (Google News seems to be doing that, given how clunky their redirect system is)
      – analytics and ad impressions that give MS/Bing almost exact traffic numbers
      – links from other pages with page rank equivalent algorithm

      Also, if you and someone else used, for example, Lucene as a part of your engine, your results will be quite similar.

      For top-rated pages, you will have more links, more traffic and more clicks. Generally while I see similar top-rated results for real-life queries, Bing seems to be sufficiently different (for example, they usually place more emphasis on page title, versus content) but I don’t use them all the time.

      But again, all bets are off when the “real” search result set is too small. I also haven’t seen them use their own site to see if Bing robot actually went and indexed the page. I.e. create another “fake” search, give result with a couple links to a web-site, created specifically for that purpose, examine log to see when/if Bing bot visited the page. If no visits are recorded, I’d be more inclined to think there’s something very fishy going on (click-stream used directly without bothering to read the page? That’d be weird)

      I think Google could give different result sets for IE8 for a while and see how that affects drift of results in Bing. That way the real impact of click-stream on results could be determined.

      Blog posts are generally considered to be an opinion. For corporations to state facts, there are generally press releases. Google could theoretically get sued for libel but for that Bing would need to show that authors of the post knew for a fact that their statement was a lie, but published it anyways. When someone “speculates/believes/infers” (i.e. experiments and makes a determination on how the search engine works) I don’t think there’s any chance of a lawsuit. But then, I’m not a lawyer 🙂

      I also wonder if there’s any remedy in the law to a situation where, say, some search engine pulls results, slightly changes them, and shows them themselves (ethically that’d be horrifyingly wrong, of course).

      Reply
      • Ericable

        “Nearly the same and the same is not exactly the same thing.”
        That’s like saying, copying a little and copying exactly is not the same thing.

        Copying is copying. Period. Full stop. (:P)

        Reply
        • Max Smolev

          I disagree. Unless you’re trying to say that Google automatically owns all results they have produced, to the fullest extent. In other words if you make a search engine, and it dares to include any sites that are present on Google’s result, regardless of how they found those sites, then you’ve just copied Google.

          Copying would be Bing running the same query through Google, scrape the result, show it as its own. But it’s not the case. In this case the outcome is driven by user behavior — user clicked, result added into consideration for the output later.

          Reply
          • Ericable

            “Unless you’re trying to say that Google automatically owns all results they have produced, to the fullest extent.”

            Are you saying that Google needs to own data in order to demonstrate that Bing copied it? If that’s the case, then a student cannot declare another student of copying test answers, because test answers can’t be owned.

            “In other words if you make a search engine, and it dares to include any sites that are present on Google’s result, regardless of how they found those sites, then you’ve just copied Google.”

            No. That’s not what’s happening here at all. We’re talking about specifically fabricated queries that lead to specific query results. (mbzrxpgjys.)

            “Copying would be Bing running the same query through Google, scrape the result, show it as its own.”

            Yes, that is one example of copying. Another way of copying is replacing a scraper with a user toolbar that performs the same essential task of reporting back query-to-results data.

  2. Ericable

    “Does that qualify as “stealing results”? Hardly, because the source is user interaction.”

    Bing’s source of the data is the user interaction.
    The user’s source of the data is Google.
    Flow of data:
    Bing <- User <- Google
    If you replace "User" with WebCrawler, you'll arrive at the same result, but the ethics becomes clear.
    Thus, my claim is that the origin of Bing's data source is actually Google.

    "Did engineers capture all of the net traffic? Can they say with 100% certainty “No data was transmitted between our laptop and any of Microsoft services that contained anything remotely seeming to link our magic word and a URL”?"

    Why do they need to prove that data was sent or not sent via the Bing bar? They just need to demonstrate cause and effect.

    Cause: Google's honeypots.
    Effect: Bings search results.

    Claim: Unique link data that originates from Google can propagate over to Bing via the Bing bar.

    Reply
    • Max Smolev

      I think that User part is the most critical in normal course of action. No, you won’t get the same result with user-webcrawler substitution. Mechanical copying would be stealing. If user ranks/submits the result (by clicking) it’s not. If Bing had a small army of people copying and pasting all results from Google into their index, it would be stealing.

      I do agree with the claim that Google results can propagate (via users that use Bing bar) to Bing, no questions there. But it can’t be presented as Bing stealing. If human element is not there, the [honeypot] result will never make it to Bing. Migration is not automated, it relies on humans.

      I wonder if Google engineers are upset about existence of meta search engines too.

      Reply
    • Max Smolev

      Re: your reply above
      Think about the honeypot scenario in this key:
      One of users says “neat, I’ll submit this URL as a relevant result to request mbzrxpgjys”. Which user just did by clicking on it with Bing bar installed (that says “we’ll look at all your search engine activities”).

      I wonder if Bing originally was thinking about scraping the whole result set and including it into their engine, but then just decided to limit it to what user clicked, to allow user interaction to filer out SEO junk. Unfiltered scraping would be stealing.

      As I wrote in the blog post, my primary objection is that Googlers chose a scenario where no other data was available for search engine. Even if action “user clicked on this URL when searching for this keyword” has a weight of a fraction of percent (a hint that this or that URL is more relevant) for synthetic single-result scenario it’s the only incoming data search engine has.

      Create two results, don’t click on the second one (at all) and see if it shows up in the Bing. If it does, it means full scale scraping and it’s not good. Otherwise the human element is critical and breaks the chain of mechanical copying that’s 100% stealing.

      Again, they didn’t announced it as “Google search results can be filtered into Bing via users that click on links with bing toolbar installed, and we don’t like it” — I would have no problem with that whatsoever. No. Instead of explaining that, they just gave a blank statement of Bing stealing results from Google. It’s like a kindergarden. “Jimmy drew a unique picture, showed it to Jane who is a friend of Steve and she described it to Steve who drew the same thing. Steve is stealing Jimmy’s idea!”

      Reply
      • Ericable

        Hi Max,

        I didn’t know you were the blog author. Thanks for the response.

        “One of users says “neat, I’ll submit this URL as a relevant result to request mbzrxpgjys”. Which user just did by clicking on it with Bing bar installed (that says “we’ll look at all your search engine activities”).”

        You know what’s interesting? When you click on a link from a Google query, you don’t get the actual link to the website, you get a sort of encrypted referral with a bunch of Google ref. data. Google uses this data for ranking purposes primarily, so they know how to parse and interpret the link. But, how did Bing figure out how to associate a Google query with a Google query result?

        >> Query: test
        >> Result: http://www.test.com
        >> Link referral: http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCEQFjAA&url=http%3A%2F%2Fwww.test.com%2F&rct=j&q=test&ei=VJ1MTfigIJG0sAOow4iiCg&usg=AFQjCNH21KLjC0CBkjon2DwD_CZ0HApLMw&cad=rja

        If Bing is relying purely on user clickstream data, then that link needs to be parsed before it can be used by Bing. That means Bing would need to understand the underlying structure of the referring link before it can be used as a Bing search result. This may indicate that Bing is knowingly parsing this data, which would be blow to their defense.

        So, say, they don’t do that.

        If Bing is relying on some other recorded metrics, such as submitted form data (which they haven’t formally disclosed, but is covered in the EULA), then when the user logs on to Google, the Bing bar needs to record the query that was sent to Google, wait a while for the user to click the link, record the user click, ignore that user click because it was a Google referral link, and then finally report back to Bing with the destination site. On top of that, if the user decided the link was insufficient and went back to click a few more links, Bing needs to continue to associate the original Google query with multiple additional webpages after some time. How on earth would it be able to do that without having some context that it was on a Google search?

        In the former scenario, the Bing bar needed the parse Google’s link for data. In the latter scenario, the Bing bar needed to be tailored to the Google UX. Either way, they’ve some custom code written interpret a user’s interactions within Google. If this is the case, then the Bing bar is not as “agnostic” as claimed. If you happen to know exactly how the Bing bar actually does it, I’d be very happy to know.

        “As I wrote in the blog post, my primary objection is that Googlers chose a scenario where no other data was available for search engine. Even if action “user clicked on this URL when searching for this keyword” has a weight of a fraction of percent (a hint that this or that URL is more relevant) for synthetic single-result scenario it’s the only incoming data search engine has.”

        I hear you loud and clear. Bing receives and interprets 1000 signals. When only 1 signal is heard, they rely on that since it is the only source. My question to you is, is copying a little the same as copying a lot? Is it okay for that one source to be Google? As we’ve discussed, Bing’s data comes from the user. However, that user’s source of data comes from Google. Is that ethical? IMHO, this defense reeks of money laundering, since a couple of roundabouts suddenly makes things clean again.

        “Create two results, don’t click on the second one (at all) and see if it shows up in the Bing. If it does, it means full scale scraping and it’s not good. Otherwise the human element is critical and breaks the chain of mechanical copying that’s 100% stealing.”

        I don’t think that anyone is claiming that Bing is scraping Google outright. I mean, even with Google’s earnest efforts, they were successful in 7% of their test cases. However, in spite of this small percentage, can we agree that Bing is benefiting from data that once originated from Google- irregardless to the degree of copying?

        “Jimmy drew a unique picture, showed it to Jane who is a friend of Steve and she described it to Steve who drew the same thing. Steve is stealing Jimmy’s idea!”

        This is not art- where there are varying degrees of relevance, similarity, or inspiration. What Google has demonstrated is discretely black and white: Specific bits of data from Google can (and does) propagate over to Bing. It’s not just vaguely similar as it can be in art and design, because they didn’t find ‘mbzrxpgjys’ with a missing ‘z’. They found it the exact same way Google constructed it, with a link pointing to http://www.rim.com. Where the only source of this data could have been from Google. Since this is concrete data, it cannot be considered “inspiration” as it can be construed in art.

        Reply
  3. Ericable

    “No, you won’t get the same result with user-webcrawler substitution.”

    Can you elaborate? Because the results disagree. Data flows from Google to Bing.

    “Mechanical copying would be stealing.”

    What’s the difference between having a mechanical system do work versus humans? The output is the same.

    Thanks.

    Reply
    • Max Smolev

      With webcrawler you would have a case of full mechanical copying. Whatever Google produced, be it relevant to users or not, gets copied. That’s stealing. A step above mechanical copying is meta search engines, that rank exact copies of results from multiple engines and show it to the user.

      If you have a user who complains to you “I am looking for blah but your engine doesn’t give me result A, fix it!” and you fix it, you’ve done it because of the user. Now, if result A was a honeypot made by Google, did you just steal Google’s data? If user didn’t tell you anything you’d never came to that result. So no. That’s why I don’t think Bing is stealing Google’s data.

      When you’re creating search results, you do it for the users. You can take user’s input into account via multiple means — URL submission, click-tracking (heck, hover-tracking, if you want to, whatever). It means you have your own algorithm, you developed it. It will give results based on different inner working than competitor’s. In case of user giving you fake data, yes, you will give back fake results. But you haven’t stolen those results.

      Reply
  4. Ericable

    Well, I see where you’re coming from, about this issue being a result of users reporting such search data back to Bing – that it is, in fact, the users improving Bing. I agree, but it is on the backs of Google-sourced data. This is not just links we’re talking about. This is the association between SEARCH QUERY and SEARCH RESULT. This association is the product of Google’s proprietary algorithms, and as such, it’s concrete data that originated from Google, which Bing utilizes. I see that you’ve stressed the human aspect of this all. However, it is as human as it is mechanical. Here, you’ve got an automated reporting service that runs on the backs of users manually clicking around on Google. I do hope Bing discloses how they get this data precisely soon, so that we can better understand the nature of the situation.

    Reply
    • Max Smolev

      Sure I’d love to know how exactly they incorporate click-stream, though I wonder how much information would be disclosed — search algorithms are usually very closely guarded secrets. Even now I can imagine some black hat SEOs preparing ads for Amazon Mechanical Turk service “10 cents for 100 clicks — search for this term with Bing toolbar installed and click on ‘our’ link!”.

      In addition to Google, data is gathered from other search engines as well, and non-search things too, though I have no idea what kind of associations clicks from some random page would bring (for example Chrome knows how to run searches on blogs — when I type “hyperom.com something” it automatically submits search for “something”). If this improves Bing, would they really stop? Should they abandon this meaningful improvement?

      Perhaps as a compromise they can give a clear warning when installing toolbar, instead of tucking it into three pages of small-print that I had to read to figure out they were warning about it. Google would still object, but they probably don’t have a real legal reason to force Bing to stop using user-reported data.

      If Bing collect data with equal weight from all sources, it would be unfair to others to exclude just Google (versus all blogs, Yandex, etc). That would mean Google is special, while everyone else is a fair game for picking off user-submitted URLs and terms one at a time.

      Another compromise would be for Google to use similar mechanism too. But if they don’t do it already, I wonder why (heck, adding Google ads automatically means the page you put the ad on is scanned by Google, even if you didn’t submit it in their index). If they do, then it would make the whole story even more odd.

      In reality I think both sides will stay with their opinion. Google will continue claim that Bing is stealing their data one click at a time. Bing will continue to collect click-stream from wherever users surf. Users will forget about the whole thing in 3-5 months.

      Reply
  5. Ericable

    Great points. It’s only been a week and I had already forgotten about my comments. One thing that really irked me the last time I was here was that I made 2 repeated attempts to post really long comments- they were really long, but I no longer see them here. 🙁

    Thank you, Max, for sharing your thoughts.

    I thoroughly enjoyed our discourse.

    Cheers.

    Reply
    • Max Smolev

      So did I, thanks for your comments.

      I don’t know what was wrong with comment system. I don’t think I’ve configured the system well enough to support sizable discussions (for example I ran into depth limit myself 🙂 ) for which I apologize. I did find a comment form you that Akismet marked as a spam, and I unmarked it, but only one.

      Again, thanks for the discussion.

      Reply

Leave a Reply