by      
Search Bots can be your friend as well as something that's going to eat you alive. If you're on a shared host or on your own little IIS server running from your Windows 7 computer it's good to know a little about what the crawl stats reports are offering.

Crawler or Search Bot Stats offer a good site health guide.

At a Glance: You can use bots to identify issues you might have with your website. In some cases if you're able to do a little programming you can have page status reported to you via text when a bot has a problem.

Search Bots can be your friend as well as something that's going to eat you alive. If you're on a shared host or on your own little IIS server running from your Windows 7 computer it's good to know a little about what the crawl stats reports are offering.


I've used bots over the years to tell me about my website. It's actually making a few of them earn their time on the site.

I have a script on some of my sites (not all) to monitor if the spider actually does what my robots.txt file offers. I know this might be a bit off topic but follow me here. My script when it noticed XYZ bot at the site would start recording to a database all the pages it would visit. When I see it hit a robots.txt restricted folder and it indexes more than one page in that folder the IP of the bot is then tossed into the 403.6 category and the script ends.

It's a very good method of protecting copyright and material that you don't want some indexes to us. I don't follow all bots only those I don't trust.

With all that off topic information I hope you're still here.

The 3 things I get out of my Google Webmaster Tools Crawl Stats is...

  1. Server Response Time averages
  2. Bandwidth spikes (check your firewalls when you think one thing you find another. DDoS vs. SgW Spider gone wild!)
  3. Overall health of the site.

 Some other things that are clearly identified are.

  1. Pages crawled
  2. Kb's downloaded per day
  3. Time downloading the page

One more side benefit is..

  1. How much does Google Really love my website?

You think I'm joking about that list benefit?
I've been in the sandbox before and it had nothing to do with SEO practices. I had a writing style in the technical community that some of the big boys didn't enjoy. I don't publish those articles anymore due to my income needs to keep coming in. But if I could... Ok, let's focus on what's at hand.

Your Love to Hate ratio will be explained at the bottom of the Good, Bad and time to move to a different server.

In this first example you'll see a website that has some serious server issues, page loading issues or a combination of both.

The image below shows serious page loading times which the Google Bot will start ignoring sooner than you expect. In fact, from my notes Nov 28th to Dec 2 was all it took to bring this active site to its knees.

The shared hosting service and server just crapped out. Once the search engine bot gave up on page loads over over 30 seconds it stopped indexing completely.

I have noticed Google using two methods of identifying your page load issues. One is with the media-bot and the other is with the spider bot. If your pages in Google Analytics are showing average times high but only show one sampling it's the media-bot that reported it in my opinion on a page that had a rendering issue. I've cross checked this with different sites and seem to be consistent. But don't quote me on this page speed measurement it's only my guess.

Once the server was completely move the search bot came back but very slowly. The spider did know to return once each day to one page and when that page responded slowly it abandoned all indexing.

Once the site was back online again it took about 7 days for the spider bot to list the site healthy again and started indexing as it did for years.

Server _Issues _My Domain _Com _Google Bot _Stopped _Completely

Below is another time slot of the same site you see above. The page timing jumped due to a bad server from a outsourced company.

Server _Issues _My Domain _Com

The image below is how the site is running today. You can see from the graph above and the one below that once the spider bot cleared the site of it's health issues it went back to indexing it. But, to completely review the pages it waited for about 45 days.

You can see the spike during a weeklong re-index of the site and the average daily indexing which has for years been about the same.

Google -Crawled -Pages -A1-31k

Another sites graph is shown below, you can see the same delay since this site had matching the above. This might be something you have seen and need to research the reason why Google Bot has slowed or stopped indexing your site. It should always be visiting pages in today's web because it determines your sites health and response times with its league of spiders. 

Google -Crawled -Pages -B1-340

Now let's take a look at healthy sites that aren't having server issues or page issues. These examples are of the same websites above but with a timeline to date.

The image below starts with a site that was rebuilt which is always something we hate doing when it costs time and hundreds if not thousands of redirects to create.

The new CMS site had 10 published pages in the first week then hit 140 by the following week. The blast in indexing didn't happen until the websites 404 pages were all redirected to the valid pages. This site ad about 400 articles published which until the redirects were added didn't get the indexing time it needed.

Webmaster -Tools -Site -Crawler -Report -X-1

From the graph you can see jump which matched the time when I finished most all the redirects even if they didn't match the actual article they were not landing on a 404 page. I started with a 302 temporary redirect and then as I published the old articles to their new pages I switched them from the 302 temporary redirect to the 301 permanent redirect.

You'll find allot of discussion about the proper method of redirecting pages and no matter which method you select you have some work to do. I spent the added time to make sure my top 300 pages stayed on top and did the move by the book telling the bots "Hey, I'm sending you here for a bit but don't remove the index, I'll get it right!"

From the graph I can say my extra weeks worth of work paid off big. But I still have about 1,200 articles to publish and those when they do go up should be indexed quickly because the health of the site is what the spyder is looking for. (or is that spider?)

Ok, now that I have covered some bad server issues and how quickly your spiders will abandon your website and then showed you the rebound I hope you look at your reports carefully and keep those page times in check. It depends on your site content or the number of scripts running as well as the health of your server.

Then I covered how to bring back a site that was rebuilt. The 302 to 301 process can save you the waiting time if you focus on the site and do things right the first time. It's easy to get listed in the index it's harder to correct a bad listing. Don't redirect to a single page with a 301 unless you are delisting pages. Do not add more than a few that way. Allow some of your pages that you do not want to fall into the 404 page response. It's best that you don't confuse the search index by redirecting a page that had content unrelated to the content of the page you are redirecting too.

That's not very clear, it's Do not redirect your Apples to your Oranges page unless you plan on having your Apples and your Oranges on that same page.

Next up! How your new website can be indexed like you designed by following a few simple and oh my gosh I should have thought about that beforehand.

First, when you rollout a website some of us start off with 2 or 3 pages and then build on that. CMS templates often offer default text that we all have seen before but you really want to be indexed so you launch those pages. Guess what, you now are working to have those search bots come back to re-index the page but they just seem to be taking their time.

Get all your early launch pages setup before you build that search engine sitemap. Don't include a Lusam idum istk page ever! Shows you didn't plan and if you're advertising a web service that's one heck of a way to show them you're a template genius.

Ok, so what's it take to launch your site and get it indexed fast? Good timing, a good server host and a good clean sitemap that offers nothing to guess about.

Below is an image of a indexing nightmare that actually worked very well.

This is a new website, never been on the internet until mid-May of this year. I started by setting all my pages as DO NOT INDEX EVER!!

Then I set my robots.txt file to block it all.

I needed time to index my site, check the pages in a production environment and I didn't need pesky bots indexing pages that didn't have content at the time. So blocking everything was my best solution and monitoring those that don't honor the robots.txt and using a firewall to take care of them.

It took about 2 weeks to fully index and setup the website but the time spent paid off. The first spider hit took the bait on 3 pages then it found the sitemaps, lots of them. The bots took a day or two just mulling over the sitemaps and then it happened. 3 pages, 100 pages, 1,200 pages next thing you know the sites 6,000 pages were all indexed and returning in search results.

If you see the spike that was one heck of an indexing day then it dropped off and is now steady at just over 2,000 crawled pages each day.

Webmaster -Tools -Site -Crawler -Report -H-1

Don't worry about the page loads at 1.5sec average this is a serious feed site that uses some very unique scripts and is loaded with old ASP code. But, for 20 years that same code has powered a few other sites. Never give up on ASP, just speed up the servers and add lots of memory!!!

I have other charts from my personal websites and other sites that I manage that show 2 to 20 crawls each day. It's because they have 2 to 20 pages. It's interesting to see the average crawls per day. This wasn't the case years ago. Once your page was indexed it was done. Today I believe more search engines are using site health as a major factor to how often the site is indexed or crawled.

Oh, by the way, crawling and indexing are different. You might see some bots hitting thousands of pages with 0 downloads. This is status page checks as well as response times of the server which is not page loads.

I know, it's getting more complicated each day. Lucky for us we have people that will write long articles about stuff that you worry about.

If what I am seeing is true SEO people will need to start learning allot about servers and that's more than PhP, .Net and .ASP learning. How is that page processed and that router, is the DNS hot or is it dead? Does every DNS being used have a record for the site? How's the response times for those DNS servers?

Lots of data to collect just to see how well your site looks to a simple application called a Search engine Indexing Spider bot.

 

Have any questions? Need something translated into English? Just ask, I'll get around to your email sometime today.

One last note:

I'm flattered by the offers to do websites. I never thought about building a website, spending more hours making someone else show up in the top pages of searches.

Actually I wouldn't mind working with a few of the best in design and graphics (99designs.com) with a few of the new applications I’ll be rolling out this year. Maybe I’ll ask you to help me and the trade will be we both profit from success or suffer the loss of time.

Search Bots can be your friend as well as something that's going to eat you alive. If you're on a shared host or on your own little IIS server running from your Windows 7 computer it's good to know a little about what the crawl stats reports are offering.