Comscore Crawler
A crawler, also known as a spider or a bot, is the software Comscore uses to visit and access the content of webpages.
The Comscore crawler:
- Identifies itself,
- Only downloads the static, textual content,
- Honors the rules of a robots.txt,
- Doesn't execute JavaScript to generate ad impressions,
- Crawls at a slow rate by default.
FAQ
Here are answers to the most common questions. If you need to know more, please contact us.
Comscore's contextual content analysis enables advertising partners to determine the best matching campaign for a page's content.
When an ad is about to be served, the crawler visits the page and the content of the page is contextually analyzed. The frequency (how often) a page is being visited depends on many factors such as type of content, change of content, number of ad elements, etc... Any number of factors can affect the crawl frequency of individual sites.
Sites may also be crawled in a linear fashion to provide site-level analysis to advertising partners who are interested in a specific site.
Sites may also be crawled in a linear fashion to provide site-level analysis to advertising partners who are interested in a specific site.
The crawler identifies itself with the user-agent:
Mozilla/5.0 (compatible; proximic; +https://www.comscore.com/Web-Crawler)
Many premium publishers explicitly allow our crawler to access their sites. Publishers benefit from our analysis and gain deep insights on their inventory to optimize direct sales and to accurately target campaigns.
To whitelist our crawler please add a separate paragraph to the robots.txt like this:
To whitelist our crawler please add a separate paragraph to the robots.txt like this:
User-agent: proximic
Disallow:
For those who need a more secure method of whitelisting, we can also add custom headers for requests to your site.
The crawler does not extract and store any source code, but only provides data about the publicly available content of the page, such as the content language, the content's rating (G, PG13, R) and relevant IAB categories of the content (e.g. "Real Estate::Buying/Selling Homes").
This analysis helps the advertiser to place topically relevant campaigns onto a safe environment. Relevance drives CPM, which benefits publishers.
This analysis helps the advertiser to place topically relevant campaigns onto a safe environment. Relevance drives CPM, which benefits publishers.
In general this should not happen. Unfortunately, some advertisers are stripping the URL parameters, which means a working URL like
www.forum.com/showthread.php?t=123
is rendered into something like this: www.forum.com/showthread.php?
If you want to exclude our crawler to not visit specific sections of your site, please add a separate paragraph to the robots.txt and specify the path you'd like to exclude:
Placing the file in a subdirectory won't have any effect. Furthermore please note that the IP addresses used by the crawler change from time to time and that it may take up to a day for changes in robots.txt to propagate across all systems.
User-agent: proximic
Disallow: /path/
Make sure that the robots.txt is in the correct location. It must be in the top directory, e.g. www.domain.com/robots.txt. Placing the file in a subdirectory won't have any effect. Furthermore please note that the IP addresses used by the crawler change from time to time and that it may take up to a day for changes in robots.txt to propagate across all systems.
Our bot usually crawls 1 request per second. However, you can control it by adding the Crawl-Delay directive to your robots.txt. A crawl-delay setting tells the bot to wait for a specific amount of time between two requests.
User-agent: proximic
Crawl-Delay: 2
With the above setting, our bot will crawl no more than 1 request per 2 seconds.