Show Mobile Navigation

February 7, 2014

PageRank Crawls & The Missing G+ PR Phenomena - G+ SEO 2014 ③

Joshua Berg - 2:29 AM

Jump into the Grid of Information Retrieval Technology for Answers



1. The phenomena of the missing Google+ profile PageRanks.
2. A deep dive into PageRank Iteration Optimization for answers.
3. Recrawl Scheduling Based on Information Longevity.
4. Google Realtime - the social search break up in favor of Google+.
5. John Mueller speaks out - on not using G+ just to build PageRank.









┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
Google Plus Authority is Powered by People - G+ SEO 2014 ①
My Google+ SEO Ideas Confirmed At 322 Days - G+ SEO 2014 ②
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅



In this article I'm going to explore several phenomena, the first being an older mystery and then a new one, and by delving into answers for the first, see if we can find the answers to the later. So let's jump into the into the Information Retrieval technology grid to see what we can find.


Phenomena #1

Most of the PageRank Toolbar (TBPR) updates show considerable lag between the time rank appears for pages & the time new pages are created. When TBPR updates occur the data there is never entirely current & usually appears to be a few months old.

Question: Why do all the TBPR "updates" not show the most recent PageRanks?



Phenomena #2

After December's PageRank 1213 update, as many as 2 thirds of Google+ profiles & pages are no longer showing any TBPR, their ranks formerly showing from the PR 0213 update, have now been blanked out from all available PR viewing tools.

Question: Where did the G+ profile TBPR's go and why? When will we be able to see them again?



Phenomena #3

While new PageRank collected for pages never appears immediately in the TBPR updates, penalties that have been manually applied to websites & pages even just prior to the update, can be reflected there immediately. And there are stories from afar (now confirmed) of other phenomena, that manual actions lifted might be quickly reflected in the TBPR, without even having to wait for the next update.

Question: Is some of the data in the TBPR more current than other data?


[Due to limited space, for Phenomena #3 I will continue that story in a follow up article, for which there are still many Google+ details to share. Be sure to follow me on G+Joshua Berg and SHARE THIS BLOG.]




1. The phenomena of the missing Google+ profile PageRanks.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔



In phenomena #2 I have observed that, after December's PageRank 1213 update, as many as 2 thirds of Google+ profiles & pages are no longer showing any TBPR, their ranks formerly showing from the PR 0213 update, have now been blanked out from all available PR viewing tools.

I recently did an in-depth analysis of PageRank of Google+ profiles & pages, comparing over 700 profile PR's from the February PR 0213 update, with the recent PR 1213 update, so you are welcome to look at those comparisons here.





*Note: Consider there to be a couple percent margin of error, as I have not had the time to go back through & verify every profile. Let me also mention here that all of these G+ profiles and pages are already public & so is their TBPR, so to the best of my knowledge none of this data was, or is private. Profiles that are set to private are not indexed by Google Search & do not get PR.


What you will see is all of the profiles & pages who upgraded to the new Google+ custom URL anytime between the first week of Sep 2013 & the first week of Dec 2013, will not show any Toolbar PR. And in case you're wondering if there is a special way to view these, the answer is NO, the PR will not be visible in any of the tools until the next Toolbar PR update.


[As with the previous TBPR updates there are a few exceptions to the above, there are still a few Google+ profiles who already had their custom URL & yet recently had their TBPR blanked. Was that intentional, or some other quirk? I don't know. But there are a few things that could have caused it. If your profile was temporarily set to private anytime during key update crawls, or if your custom URL changed. Capitalization of custom URLs can be changed by individual users, this may, or may not have affected the TBPR.]


As most people got their custom URL acceptance emails in the big release starting 10/31/13, that leaves the majority of G+ profiles without viewable TBPR. The emails were like this...



Get a custom URL for your Google+ profile

10/31/13

Dear Joshua Berg

You're now eligible for a unique Google+ custom URL that lets you easily point folks to your profile (no more long URLs!). Here's what we've reserved for you:
google.com/+JoshuaBerg



Before I get any further let me say this, the profiles & pages all still have their actual Google PageRank, that still exists at these entities root canonical URL which is actually the profile number like this https://plus.google.com/110133760398936676625 and not the custom.

However due to the custom URL's replacing these original profile number URL's from October, they were not yet validated & calculated into the last Toolbar PR update. And by the way, the canonical profile number URL is only using a 302 temporary redirect to the new G+ Custom URL, so the profile number stays & always will be the unchangeable root URL of your profile. As for the PR flow implications of this, Google eventually sorts it out internally (namely through the canonical tag) & you can use either of these URL's however, or wherever you like.


So as far as Google Search & your G+ profile authority goes, there's nothing to worry about here. This is really just a technical issue from here out, which admittedly may be more interesting to SEO analysts than everyone else. Although there are a number of useful things to learn from this regarding how Google crawls & verifies its data that will have practical application in this story.



Now there were a number of verified, or famous persons, who received their custom URL's earlier & they currently show their TBPR quite clearly through their custom URLs.

The last of these that I have specific data on, is an analyst friend of mine +Lee Smallwood,  who received his custom URL in Aug 8, 2013, only 3 weeks before the Sept PR update cutoff. And his profile clearly shows a PR 4.


Shared publicly - Aug 8, 2013

What a fantastic surprise...!

I'm over the moon to have this notification from Google+ 


So this date being just prior to Sept helps to confirm my timeline (there are a great many other profiles I have since analyzed with confirmatory results) as you can see here...







The question then remains, why don't the Google+ profile's original number URL's show Toolbar PR anymore, even if the new vanity URL's do not?

That's because the canonical profile URL is now 302-redirected to the new custom URL & so fails the confirmatory recrawls & ongoing PageRank Iterations (PRI). The vanity URL also does not qualify for Toolbar PR, because it did not exist prior to that time period.




2. A dive into PageRank Iteration Optimization for answers.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔



Google bots (or spiders), like all other search engine bots, constantly crawl the web collecting data & then recrawling to add more data & verify the data already crawled. It's a never ending process & the science of this is called Information Retrieval technology & there have been a huge volume of papers & studies written on this & they continue till this day.

More specifically PageRank data itself is collected & calculated over a period of time, and the computations that make up the final PageRanks have gone through many passes called iterations. The WWW and its links are constantly changing & so PR will never be exact from every link, everywhere all the time, instead it must be computed from all the collected links after fragile edges have been temporarily separated and all of the PR iterations are computed till they converge at an acceptable level for useable PR.



The PageRank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.



From here I'll go into more about the various Iterative algorithms for PageRank with excerpts from just a few of the related papers & thesis. You should consider though that these related excerpts I'm sharing here only show a small part of the story, as this topic is far more in-depth than we're going to comprehensively cover here. I actually went through a vast quantity of studies not included here, but had to drag myself out of these papers & say, enough is enough, just finish writing the story.







One of the challenges of calculating PageRank is deciding what to do with the fragile links, pages (on which are nodes) that disappear, or change frequently. To solve this problem there have been numerous approaches studied & papers written. The Policy Iteration algorithm being one of the more famous among them.

For an in-depth study on this Romain Hollanders 2010 Thesis provides some exceptional insight.
On the Policy Iteration Algorithm for PageRank Optimization


Sometimes, it happens that the manipulated graph is not fixed and can be modified in some ways. In particular, we may have that a given subset of edges are fragile, meaning that the edges of that set can be either present or absent. In this context, optimizing the PageRank of a page depending on which fragile edges are active or not is a natural concern already investigated.

This problem has several real life applications. For example, a webmaster could be interested in increasing the PageRank of one of its webpages by determining which links under his control he should activate and which ones he should not.

Another typical example is when some links of the graph can disappear and then reappear uncontrollably. Typically, these links can be broken because the server is down or because of traffic problems. In that case, we still may want to be able to estimate the importance of a node by computing its minimum and maximum PageRank, depending on the presence or absence of the fragile links.



Here's a couple of excerpts from Larry Page's second paper on PageRank at the Stanford InfoLab, regarding the changing nature of the web & calculating PR until all the iterations come to a reasonable convergence. Of course this barely covers the topic & you can read from the link for more... The PageRank Citation Ranking: Bringing Order to the Web.

To implement PageRank, the web crawler simply needs to build an index of links as it crawls. While a simple task, it is non-trivial because of the huge volumes involved...

Much time has been spent making the system resilient in the face of many deeply and intricately flawed web artifacts...

...Of course it is impossible to get a correct sample of the "entire web" since it is always changing. Sites are sometimes down, and some people decide to not allow their sites to be indexed. Despite all this, we believe we have a reasonable representation of the actual link structure of publicly accessible web.



Excerpt from Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1998)




So there's PageRank Iterations & there's data crawling & verifying, which is a phenomenal never ending task. All this to say, the reason PR data is never entirely current is, because the data must be recrawled & verified to a high enough level of accuracy to insure its integrity. That said, the time gaps between the new PR cutoff & the actual update have varied quite some weeks on different updates & so it is still assumed that Google does not push out the TBPR data as fresh as it could be.



These then become the answer to above mentioned Phenomena #1 and Phenomena #2, finding the balance between data freshness & data integrity.




3. Recrawl Scheduling Based on Information Longevity.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔



One of the biggest challenges in Information Retrieval (IR) technology is finding better ways of keeping data accurate & fresh, while it continues to grow exponentially, in a way that is efficient & economical to do so. Search engine bots simply cannot cannot crawl the web for indexing everywhere all the time, so content crawling must be prioritized & scheduled.

For a simple example, imagine trying to watch all your cable TV channels at the same time, while simultaneously watching every movie you've ever watched, then categorizing & rating those shows faster than new ones are coming in. Well of course not, you'd need fast & expensive computers, incredible bandwidth & you'd still need an incredible scheduling system to get it done.

Three Googlers Carrie Grimes, Daniel Ford, Eric Tassone, in their paper Keeping a Search Engine Index Fresh: Risk and optimality in estimating refresh rates for web pages (2008) expound on the challenges in-depth.



Search engines crawl the web to download a corpus of web pages to index for user queries. Since websites update all the time, the process of indexing new or updated pages requires continual refreshing of the crawled content. In order to provide useful results for time- sensitive or new-content queries, a search engine would ideally maintain a perfectly up-to- date and comprehensive copy of the web at all times. This goal requires that a search engine acquire new links as they appear and refresh any new content that appears on existing pages.

As the web grows to many billions of documents and search engine indexes scale similarly, the cost of re-crawling every document all the time becomes increasingly high. One of the most efficient ways to maintain an up-to-date corpus of web pages for a search engine to index is to crawl pages preferentially based on their rate of content update.

If a crawler had perfect foresight, it could download each page immediately after the page updates, and similarly acquire new pages as they appear. However, no complete listing or update schedule exists today, although some limited collection of new-content sites, such as news pages, provide a feed of such structured data to search engines. For most of the web, a crawler must attempt to estimate the correct time scale on which to sample a web page for new content...




Recrawl Scheduling Based on Information Longevity by Christopher Olston and Sandeep Pandey (2008)



It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time.

In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.




The Impact of Crawl Policy on Web Search Effectiveness - Using PageRank to prioritize crawls.

Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation frame- work, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1 billion and 100 million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method.



Lastly this article by Greg Linden Dec 3, 2009 Recrawling and keeping search results fresh
has a lot of interesting insights on this from an SEO perspective.




4. Google Realtime & social search break up in favor of Google+.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔




Since I'm talking about the importance of Google+ for Search, you might be wondering what happened about using the other big networks for SEO as well? In Part 2 of this series I told you Matt Cutts has just said...



As far as doing special specific work to say, "oh you have this many followers on Twitter, or this many likes on Facebook..." To the best of my knowledge, we don't currently have any signals like that in our web search ranking algorithms.



You catch that? Facebook, Twitter social signals = NO! As far as I'm concerned, that also effectively means that for social signals used by Google in search, Google Plus SEO is the way to go.




Facebook, Twitter social signals? -- NO!
That leaves Google+ SEO, the way to go.



So when did Google specifically stop using social signals from these other networks in their algorithms? On July 3, 2011, the attuned folks over at Search Engine Land noticed the Google Realtime (Google's experimental social search engine) site had gone offline...









Next day the explanation came from Google which you can read about in their article here... 



As Deal With Twitter Expires, Google Realtime Search Goes Offline

Today comes the reason why: Google’s agreement with Twitter to carry its results has expired, taking with it much of the content that was in the service with it.

Google also stressed that went Google Realtime Search relaunches — something it says will happen but with no set time frame — it will include content from a variety of sources and not just be solely devoted to Google+ material. The company said:

"Our vision is to have google.com/realtime include Google+ information along with other realtime data from a variety of sources."



That social aspect of search, is what then came along with Search Plus Your World, Google's personalized social friendly (to Google Plus) search algorithm built right into the Google.com Search site. But not everyone was happy about Twitter & Facebook getting left in the cold. The BBC writes:



Twitter unhappy about Google's social search changes

Twitter has complained about changes made by Google to integrate its social network Google+ into search results. The new feature, called Search plus Your World, will automatically push results from Google+ up the search rankings.

The three changes are:
  •     Personal Results
  •     Profiles in Search [Google+ authors attached to search content as Authorship]
  •     People and Pages [Google+ profiles & pages themselves showing as entities]
   
Twitter's general counsel Alex Macgillivray tweeted in response to the changes: "Bad day for the internet. Having been there, I can imagine the dissension @Google to search being warped this way."



Google hit back at the criticisms.



[in other words Twitter links don't give PageRank for search rankings].


This refers to a technical barrier which makes it difficult for Google to rank Twitter information, a spokeswoman explained.

There is also little sharing between Google and its other big rival Facebook.

Stepping into the growing row between the three firms, Google chief executive Eric Schmidt told MarketingLand magazine that his company was not favouring its own social network over Facebook and Twitter.  

He said that all would be treated equally if the two rivals granted the search giant the right permissions to access their content. [This probably means more back-end technical information, such as IP addresses, or verification data, to more accurately define data integrity. All of this they get in Google+.]


Search expert John Battelle said in his blog post that social search would mean little until Facebook and Google settled their differences...

"The unwillingness of Facebook and Google to share a public commons when it comes to the intersection of search and social is corrosive to the connective tissue of our shared culture," he said.



In Matt Cutts just released video I referred to last time, he says:


Because we're sampling an imperfect web, we have to worry a lot about identity, when identity is already hard. [ie. "we already have this in G+, so we don't need others.] And so, unless we were able to get someway to solve that empass where we had better information , that's another reason why the engineers would be a little bit wary, or a little bit leary, about trying to extract data when that data might change & we wouldn't know it, because we were only crawling the web.



Of course you already know where Google can get verified entity data, that's why you're reading this.

No, this doesn't mean that Google Search will not index all Facebook & Twitter content. They still follow to, index much of it as they are able. And the public content that is indexed is eligible for & receives PageRank, but that PR can't be passed on through any links shared there. The priority & criteria for finding & indexing that all goes back to how much of that content Google reaches from crawling the web & the PageRank that public content has received, which although it can't be passed on, does affect the SERP's of the same.




5. John Mueller cautions not to use Google+ just to build PageRank.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔



Always something good from John Mueller's Google Webmaster Central Office Hours Hangouts & in our meeting just last week he did not disappoint, discussing a valuable caution on this topic. Someone submitted the following question...







Question:
What's the PR value of a followed link on my Google Plus account? Is it based on the number of followers I have, the number of +1 's, reshares, or followed links, how does it compare to an external link from a website?



John Mueller:
So I think if you're using Google Plus as a way of link building to try to drive PageRank to your site, you're probably doing it wrong. So that's not something where I'd focus on anything like PageRank, but instead use Google Plus to kind of engage with your audience & make sure that you're doing the right thing on your website, see what people are interested in & see what you could be improving on your website.
And I wouldn't use that as a way to kind of drive PageRank to your site. Uh, we generally try to treat these pages as we would normal other pages on the web & if there are links on there that don't have a nofollow, we try to take those into account.
But at the same time we also see that, this is essentially user generated content & to some extent we have to be careful with how we handle that. So, it's definitely not the case that I'd say, "this is the proper way of building links for your website," or that, "this is a good way to get PageRank for your website."
I'd really try to use Google Plus more as a way to engage with people and to interact with them, share your content, see how people react to it & think about ways that you could improve your content together.


I would like to point out here that what John Mueller is saying, is right for several reasons. The PageRank received from Google+ posts behaves differently than what a site traditionally receives from a webpage environment & that will be part of the next story. Stay tuned...



1 comments:

Post a Comment