Duplicate Content: All Evidence Considered, All Questions Answered4 Comments
Duplicate content. One of the most hotly contested and widely shrouded-in-mystery concepts of SEO. I’m going to tackle this concept right here, right now. I decided to write about this topic for two reasons:
1) I am about to launch a massive Website which I hope to monetize quickly, but I need to know if it’ll be a good idea to syndicate the content from the site as a viable method for obtaining direct referral traffic and backlinks without compromising its organic search traffic (more on this later).
2) I searched for two hours last night and couldn’t find a definitive conclusion to this question.
In this blog post, I’m going to address everything you want and/or need to know about duplicate content. This is partially for my own personal future reference as I will undoubtedly face this question again in the future, but also to share with you the fruits of hours upon hours of research, testing, and analysis that I’ve been working on. After all, sharing makes everything more fun, right? OK, take a deep breath. Here goes.
Duplicate content: What is it?
Here’s Google’s own definition of duplicate content:
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.
So basically, there are two types of duplicate content:
- Duplicate content within the same domain
- Duplicate content across different domains
First, let’s cover duplicate content within the same domain.
Q: Is there a duplicate content penalty?
Ever since I started getting my feet wet in SEO, this question has swirled around forums and blogs. Somewhere, someone out there perpetuated the idea that having the same content on page A of your Website as page B of your Website would cause your site to be penalized in search engine rankings. This idea began to percolate in the internet marketing community because a bunch of spammers realized that when they had a piece of content (ie, an article) that was getting a lot of search traffic, they could fill up every page of their Website with the same content in order to pull even more traffic from the search engines. Obviously, the same article blatantly duplicated across hundreds of pages within a single domain is a malicious attempt to gain search engine traffic without actually adding any value. Google caught on pretty quickly to this method and fixed its algorithms to detect duplicate content and display only one version of it in the search rankings. Websites that engaged in this blatant activity were de-indexed and cried up a river across forums and blogs throughout the internet marketing community. Thus was born the fear of the “duplicate content penalty.”
However, in the vast majority of cases, duplicate content is non-malicious and simply a product of whichever CMS (content management system) the Website happens to be running on. For example, WordPress (the industry-standard CMS) automatically creates “Category” and “tag” pages which list all blog posts within certain categories or tags. This creates multiple URLs within the domain that contain the same content. For example, this particular post will be on the root domain (www.jaysondemers.com, while it remains on the first page), the “single post” version (which you can find by clicking the title of the blog), and in the “Categories” and “Tags” pages. So that means this particular post will be duplicated 4 times on this domain. But am I doing that intentionally in order to get more search engine traffic? No! It’s simply a product of the automatic, behind-the-scenes work that my CMS (WordPress) is doing.
Google knows this, and they are not going to penalize me for it. Millions of Websites are running on WordPress and have the exact same thing happening. But what if I were to take this particular post and re-post it 100 times in a row on my blog? That would definitely send red flags when Google’s crawler sees it, and one of two things will happen at that point.
1) Google may decide to let me off with a “warning” and simply choose not to index 99 of my 100 duplicate posts, but keep one of them indexed. NOTE: This doesn’t mean my Website’s search rankings would be affected in any way.
2) Google may decide it’s such a blatant attempt at gaming the system that it completely de-indexes my entire Website from all search results. This means that, even if you searched directly for “jaysondemers.com” Google would find no results.
So, one of those two scenarios is guaranteed to happen. Which one it is depends on how egregious Google determines your blunder to be. In Google’s own words:
Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don’t follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.
This type of non-malicious duplication is fairly common, especially since many CMSs don’t handle this well by default. So when people say that having this type of duplicate content can affect your site, it’s not because you’re likely to be penalized; it’s simply due to the way that web sites and search engines work.
Most search engines strive for a certain level of variety; they want to show you ten different results on a search results page, not ten different URLs that all have the same content. To this end, Google tries to filter out duplicate documents so that users experience less redundancy.
So, what happens when a search engine crawler detects duplicate content? (from http://searchengineland.com/search-illustrated-how-a-search-engine-determines-duplicate-content-13980)
The final word: Duplicate content on the same domain
The final word is that, unless you are really blatantly duplicating your content across tons of URLs within the same domain, there’s nothing to worry about. One of your URLs on which the duplicated content resides will be indexed and chosen as the “representative” of that URL cluster. When users perform search queries in the search engines, that particular piece of content will display as a result for relevant queries, and the other URLs in the dupe cluster will not. Simple as that.
However, the other side of the coin is duplicate content across different domains. And that’s a whole different monster. Ready to tackle it? Here we go.
Duplicate content across domains: What is it?
Sometimes, the same piece of content can appear word-for-word across different URLs. Some examples of this include:
- News articles (think Associated Press)
- The same article from an article directory being picked up by different Webmasters
- Webmasters submitting the same content to different article directories
- Press releases being distributed across the Web
- Product information from a manufacturer appearing across different e-commerce Websites
All these examples result from content syndication. The Web is full of syndicated content. One press release can create duplicate content across thousands of unique domains. But search engines strive to deliver a good user experience to searchers, and delivering a results page consisting of the same pieces of content would not make very many people happy. So what is a search engine supposed to do? Somehow, it has to decide which location of the content is the most relevant to show the searcher. So how does it do that? Straight from the big G:
When encountering such duplicate content on different sites, we look at various signals to determine which site is the original one, which usually works very well. This also means that you shouldn’t be very concerned about seeing negative effects on your site’s presence on Google if you notice someone scraping your content.
Well, Google, I beg to differ. Unfortunately, I don’t think you’re very good at deciding which site is the originator of the content. Neither does Michael Gray, who laments in his blog post “When Google Gets Duplicate Content Wrong” that Google often attributes his original content to other sites to which he syndicates his content. According to Michael:
However the problem is with Google, their ranking algo IMHO places too much of a bias on domain trust and authority.
And I agree with Michael. For much of my internet marketing career I have syndicated full articles to various article directories in order to expand the reach of my content while also using it as “SEO fuel” to get backlinks to my Websites. According to Google, as long as your syndicated versions contain a backlink to your original, this will help your case when Google decides which piece is the original. Here’s proof:
First, a video featuring Matt Cutts, a well-known blogger and search engine algorithm engineer for Google:
The discussion on syndication starts at about 2:25. At 2:54 he says you can tell people that you’re the “master of the content” by including a link from the syndicated piece back to your original piece.
In cases when you are syndicating your content but also want to make sure your site is identified as the original source, it’s useful to ask your syndication partners to include a link back to your original content.
Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to use the noindex meta tag to prevent search engines from indexing their version of the content.
Now, what I think is interesting from this last quote from Google is that they actually admit that the piece of content they choose may not be the right one. In my experience, it’s very likely not to pick the right one if the site that originated the content is relatively young or has a low PageRank. So this raises the next big issue:
How do I get ranked as the original source for the content I syndicate?
I’ve syndicated tons of my articles to EzineArticles only to see Google credit them with higher search results for my content, even when I made fully sure that Google had indexed my content at its original location prior to submitting it to Ezine. Vanessa Fox, who previously worked at Google and built Webmaster Central, attempts to tackle this question in her blog post, “Ranking as the Original Source for the Content you Syndicate.”
Unfortunately, she concludes that, basically, there’s nothing you can do to ensure that you do. She suggests:
Create a different version of the content to syndicate than what you write for your own site. This method works best for things like product affiliate feeds. I don’t think it works as well for things like blog posts or other types of articles. Instead, you could do something like write a high level summary article for syndication and a blog post with details about that topic for your own site.
Rewriting a piece of content is not my definition of syndication. That’s just rewriting an article in different words and distributing it. Almost all information circulating on the Web has already been posted elsewhere anyway; even this blog post is composed of a ton of information that I found elsewhere on the internet. So to me, writing a new article that says the same thing in different words and distributing that to syndication partners isn’t really syndication of the original article. It’s syndication of a different article. So we’re still left with the question of the results of syndicating the exact same content that already appears on your Website: what are the effects of doing so? Can it harm my rankings in any way?
To me, this is the most important question surrounding duplicate content. Before I jump into that analysis, let’s consider an important foundational question.
Why would I want to syndicate the exact same content from my Website elsewhere?
The internet really operates on a simple economy of give-and-take. The two commodities that are exchanged are unique content and backlinks. Unique Content is defined as content which Google does not identify as duplicate. There are various theories about where exactly Google draws the line of deciding whether content should be considered duplicate, but one figure I’ve heard tossed around a lot is 30%. Basically, according to the 30% theory, if Google identifies that more than 30% of a particular piece of content appears elsewhere across the internet, it’ll be categorized as duplicate. Now, I can’t attest to the accuracy of this figure, so take it for what it’s worth. There’s also various duplicate content-detection software such as CopyScape which is designed to help Webmasters check to see if their content has been stolen and duplicated across other domains. This is also a good tool to use to determine whether your content is likely to be considered duplicate by Google. And that’s what really matters.
But I’ve gotten a bit off track, let’s get back to the discussion of why you’d want to syndicate content. I mentioned the internet economy of backlinks and unique content. Unique content is desirable because it will be indexed by Google, giving that particular Website another instance of its “name in the hat” so to speak. Basically, the more content a Website has indexed, the more chances it has of being returned in Google’s search results for relevant queries.
But what about backlinks? Backlinks are simply links from any other Website to your own. Search engines consider it a “vote” when one Website links to another. This vote is used to determine authority & relevance in Google’s search results. In fact, it’s thought that backlinks are the single most-important factor in determining how your Website should rank for a given query. There are a ton of factors that play into backlinks and how much their “vote” counts for, but I’ll get into that in a future blog post. For now, what you need to know is that backlinks are valuable because they improve your rankings in the search engines, and that means more traffic to your Website.
OK, so now we’ve covered the basic commodities of the micro-economy of the Web. This is important because when you syndicate your content, assuming you have included a backlink in it linking back to your original source, you get a backlink from each and every Website to which your content was syndicated. Awesome, right?
Maybe not. The first question is how highly Google values a backlink from a piece of content that is known to be duplicate content. Frankly, I don’t know. On the one hand, it’s easy to syndicate content to a bunch of auto-accept blogs if your sole goal is to get backlinks, and this says nothing about the quality of your content or how much the originator of the content should be rewarded. On the other hand, syndication can also be a great indicator of the quality of a particular piece of content. After all, why would it be syndicated so much if it weren’t really great?
In the end, Google probably has signals for how it answers these two questions, but the real answers are probably only known by the software engineers that coded the algorithm. Many folks try to boost the value of their syndicated content by engaging in content “spinning” which is perfectly legitimate as long as it’s not the garbage that’s often spouted out by automated software. I’ll go into more depth about content spinning in a later post. For now, we’re still trying to answer the question of whether syndicating content exactly as it appears on your own Website is a good idea or a bad idea. After careful testing I’ve come to the following conclusion:
I know, I know. That’s not the answer you wanted. Allow me to explain.
I own over 50 domains, and I like to do a lot of testing across them. I spent a couple hours last night performing searches for my content that I had syndicated to various other blogs and directories. And what I found was both disappointing and encouraging.
The disappointing part was that, in many cases, my syndicated content outranked my own original content. Even if a site ranked higher than mine for my own content had a backlink to my site, the originator of the content, it was like Google completely ignored that backlink and still gave more credit to the other sites. In some cases, my own site’s version of the content was nowhere to be found, obviously falling into Google’s duplicate URL cluster and being filtered out of the search results. This means that by syndicating my content, I actually, in effect, got my own content de-indexed.
This is pretty much the worst possible scenario, but it happened. Sometimes, at least. And that’s the weird part; sometimes, my content was recognized as the original content and received the highest ranking. With other sites and pieces of content, it ranked second behind a high-authority site, usually EzineArticles. So I have to conclude the following:
When you syndicate your content, it might:
- Cause your own, original content source (ie, your Website) to be, in effect, de-indexed for that piece of content
- Cause your site to rank highly for queries relevant to your content, but not highest
- Cause your site to rank highest for your content
Well, that pretty much covers all the bases, doesn’t it? These are all the results I observed when looking at my own sites and the results of syndicating articles that originated on those sites. Basically, I can conclude that Google just doesn’t always get it right. And, Google doesn’t like to do anything with any sort of consistency. The last thing they want is for us SEOs to completely figure out their algorithm, because once that happens, the integrity of their search results will be destroyed as folks manipulate them all to hell.
The encouraging part was when I discovered that the backlinks from the syndicated content definitely helped my sites’ rankings for my target keywords. So there is definitely at least some value of backlinks originating from content which Google has labeled as “duplicate.”
So, the final question remains: Should I syndicate my content?
Let’s look at the benefits of doing so:
Benefits of syndicating your content:
- Get backlinks from lots of sites
- Expand your reach and brand awareness to highly-trafficked sites
- Get direct traffic via referrals from backlinks in your syndicated content
- Much cheaper way of getting backlinks than writing brand-new content (or re-writing existing content) for distribution/syndication
Drawbacks of syndicating your content:
- The sites to which you syndicate might actually outrank you for your own content if they have higher authority than your own site, even if you follow Google’s advice and include a backlink to the original source of the content
- Google might group the URL on which your content resides with the rest of the duplicates, hiding it from search engine results pages (effectively de-indexing it)
So, in the end, syndicating your content is risky. You can definitely get the best of both worlds if Google decides your site is the originator of the content, thereby rewarding your content with the top position in the search results and also getting all the juicy backlinks that play into your overall rankings for specific keywords. But if Google gets it wrong (and it does, quite often, contrary to what they might think), you risk having your content never rank for relevant search engine queries.
And this really worries me, because I’ve always held the opinion that there’s nothing else someone else can do to harm the rankings of a particular Website. After analyzing these results, I fear I’ve found a loophole in my own argument; If someone else visits my Website, copies all my content and syndicates it around the Web, it’s possible that the sites to which my content was syndicated will actually rank higher for it than my own site. Google tries to address this problem here as well as in the Matt Cutts video:
In most cases a webmaster has no influence on third parties that scrape and redistribute content without the webmaster’s consent. We realize that this is not the fault of the affected webmaster, which in turn means that identical content showing up on several sites in itself is not inherently regarded as a violation of our webmaster guidelines. This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.
Again, unfortunately I have to point out that in my own experience, repeatedly, I’ve seen my own content rank worse than the sites to which it was syndicated. So even though Google thinks it’s good at identifying the original source of the content, my data suggest otherwise. In time, we can only hope that Google improves this aspect of its algorithm; there’s certainly nothing more we can do as Webmasters. Instead, you just have to understand the benefits and drawbacks of syndication and decide whether you’re comfortable with taking on the risks of having Google wrongly identify ownership of your content.
Here are a couple tips to minimize the risk of Google getting it wrong (in theory):
- Always post new content to your own Website and then wait to syndicate it elsewhere until Google has crawled and indexed your content. You can check to see if a particular page has been indexed by performing a search query of your exact URL, in quotes. If the search returns the correct result (ie, not zero results) then it has been indexed. Another neat trick you can try is to randomly select 11-12 words from your content and search for that string, again in quotes. You wouldn’t think it, but the likelihood that any 10-12 words in a specific sequence will appear elsewhere on the Web is extremely small. Try it now — copy and paste a random sentence from this paragraph into Google, surround it in quotes, and see how many results you get. You will probably only find this URL as a result, unless this article has been syndicated (this is also a great way to check out which sites have picked up your content when you syndicate it).
- Always include a backlink in your syndicated version to the original content source URL. Google says this is the way to do it right, but it’s still not a surefire thing. Nonetheless, it certainly can’t hurt.
What about taking Vanessa’s suggestion and re-writing your content before syndicating it?
This would definitely solve the problem of possibly getting your own content essentially de-indexed when Google wrongly attributes content ownership, but there are some major problems with it too:
- It’s really expensive if you have a lot of content. Think about how much time it would take you to rewrite each article you have. This post alone is over 4,000 words and took me 3+ hours to type! You could outsource the rewriting to a service like Human Rewriter but that will cost you around $4 per 500 words. That could get very expensive if you have a lot of content.
- You are still distributing content that is topically themed around the same keywords as your original content, so it’s not a stretch to think that the rewritten content would still outrank your original content for relevant search queries, especially on high-authority sites such as EzineArticles.
In the end, it all comes down to testing on a massive scale, getting solid data and making decisions based on that data. So here’s what I’m going to do. I’m going to run a huge test and then update this post with my results. At the beginning of the post I mentioned that I am soon launching a massive Website with tons of unique content. I’m going to syndicate it all, completely unedited, as far and wide as I possibly can. As I do so, I’ll monitor traffic sources to see what keywords people are using to find my content. Then, I’ll replicate those keyword queries in Google and see where my site ranks in the search results. This should be the definitive test for the merits of syndication.
Thanks for sticking with me through this post! Check back soon for updates.
Further reading on duplicate content:
Google Webmaster Central
Demystifying the Duplicate Content Penalty
Duplicate content due to scrapers
Ranking as the original source for content you syndicate
When Google gets duplicate content wrong
How a search engine determines duplicate content