When it comes to SEO, most people know that a website must
have content, "search engine friendly" site architecture/HTML, and
meta data (title tags and meta descriptions).
Another meta element, if implemented improperly, that can
also trip up websites is robots.txt. I was recently reminded of this while
reviewing the website of a large company that had spent extensive money on
building a mobile version of their website, on a sub-directory. That’s fine,
but having a disallow statement in their robots.txt file meant that the website
wasn’t accessible to search engines (Disallow: /mobile/)
Let’s review how to properly implement robots.txt to avoid
search ranking problems and destructive your business, as well as how to
correctly disallow search engine crawling.
What is a Robots.txt File?
Simply put, if you go to domain.com/robots.txt, you should
see a list of directories of the website that the site owner is asking the
search engines to "skip" (or "disallow"). However, if you
aren’t careful when editing a robots.txt file, you could be putting information
in your robots.txt file that could really hurt your business.
There's tons of information about the robots.txt file
available at the Web Robots Pages, including the proper usage of the disallow
feature, and blocking "bad bots" from indexing your website.
The general rule of thumb is to make sure a robots.txt file
exists at the root of your domain (e.g., domain.com/robots.txt). To exclude all
robots from indexing part of your website, your robots.txt file would look
something like this:
User-agent:
* Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
The above syntax would tell all robots not to index the
/cgi-bin/, the /tmp/, and the /junk/ directories on your website.
Other Real Life Examples of Robots.txt Gone Wrong
In the past, I reviewed a website that had a good amount of
content and several high quality backlinks. However, the website had virtually
no presence in the search engine results pages (SERPs).
What happened? Penalty? Well, no. The site's owner had
included a disallow to "/". They were telling the search engine
robots not to crawl any part of the website.
In another case, a SEO company edited the robots.txt file to
disallow indexing of all parts of a website after the site's owner stopped
paying the SEO Company.
I also remember reviewing a company's website and noticing
that quite a lot of directories that were part of their former site were
disallowed in their robots.txt file. The company should have set up a 301
permanent redirect to pass the value from the old web pages on the site to the
new pages instead of disallowing the search engines to index any of the old
legacy pages. Thus, all of the value was lost.
Robots.txt Dos and Don't
There are many good reasons to stop the search engines from
indexing certain directories on a website and allowing others for SEO purposes.
Let's look at some examples.
Here's what you should do with robots.txt:
Take a look at all of the directories in your website. Most
likely, there are directories that you'd want to disallow the search engines
from indexing, including directories like /cgi-bin/, /wp-admin/, /cart/,
/scripts/, and others that might include sensitive data.
Stop the search engines from indexing certain directories of
your site that might include duplicate content. For example, some websites have
"print versions" of web pages and articles that allow visitors to
print them easily. You should only allow the search engines to index one
version of your content.
Make sure that nothing stops the search engines from
indexing the main content of your website.
Look for certain files on your site that you might want to
disallow the search engines from indexing, such as certain scripts, or files
that might contain email addresses, phone numbers, or other sensitive data.
Here's what you should not do with robots.txt:
Don't use comments in your robots.txt file.
Don't list all your files in the robots.txt file. Listing
the files allows people to find files that you don't want them to find.
There's no "/allow" command in the robots.txt
file, so there's no need to add it to the robots.txt file.
By taking a good look at your website's robots.txt file and
making sure that the syntax is set up correctly, you'll avoid search engine
ranking problems. By disallowing the search engines to index duplicate content
on your website, you can potentially overcome duplicate content issues that
might hurt your search engine rankings.
One last note: if you aren't sure whether you can do this
correctly, please consult with a SEO professional.
Comments
Post a Comment