There are numerous motives you could will need to discover all the URLs on an internet site, but your correct purpose will figure out Everything you’re trying to find. By way of example, you may want to:
Discover each and every indexed URL to research problems like cannibalization or index bloat
Obtain existing and historic URLs Google has seen, especially for web-site migrations
Discover all 404 URLs to Get well from article-migration mistakes
In Every single state of affairs, only one Instrument won’t Present you with everything you'll need. Regrettably, Google Search Console isn’t exhaustive, along with a “site:illustration.com” search is restricted and tricky to extract knowledge from.
In this write-up, I’ll stroll you thru some tools to create your URL record and before deduplicating the data using a spreadsheet or Jupyter Notebook, determined by your website’s size.
Old sitemaps and crawl exports
In the event you’re trying to find URLs that disappeared from the live website lately, there’s an opportunity someone on your own staff could have saved a sitemap file or perhaps a crawl export prior to the changes were created. For those who haven’t currently, look for these data files; they can often supply what you require. But, for those who’re looking at this, you most likely did not get so Fortunate.
Archive.org
Archive.org
Archive.org is a useful tool for Search engine optimisation tasks, funded by donations. If you seek for a site and choose the “URLs” alternative, you'll be able to entry nearly ten,000 shown URLs.
Nonetheless, There are several limitations:
URL Restrict: You'll be able to only retrieve as much as web designer kuala lumpur ten,000 URLs, that is inadequate for greater web-sites.
High quality: Quite a few URLs may be malformed or reference useful resource documents (e.g., visuals or scripts).
No export choice: There isn’t a built-in technique to export the listing.
To bypass the lack of the export button, utilize a browser scraping plugin like Dataminer.io. On the other hand, these limitations necessarily mean Archive.org may not give an entire Answer for much larger web pages. Also, Archive.org doesn’t indicate no matter if Google indexed a URL—but when Archive.org discovered it, there’s a good prospect Google did, much too.
Moz Professional
Although you may typically use a link index to search out exterior web sites linking for you, these instruments also explore URLs on your website in the method.
Tips on how to use it:
Export your inbound inbound links in Moz Pro to get a quick and simple listing of goal URLs from a web-site. Should you’re addressing a massive Web site, consider using the Moz API to export facts past what’s manageable in Excel or Google Sheets.
It’s crucial to Be aware that Moz Professional doesn’t affirm if URLs are indexed or found out by Google. Nevertheless, due to the fact most web-sites use the identical robots.txt procedures to Moz’s bots as they do to Google’s, this method typically is effective nicely as a proxy for Googlebot’s discoverability.
Google Search Console
Google Look for Console provides numerous worthwhile resources for making your list of URLs.
Inbound links reports:
Comparable to Moz Professional, the Back links portion offers exportable lists of focus on URLs. Regretably, these exports are capped at one,000 URLs each. It is possible to use filters for distinct webpages, but considering that filters don’t use for the export, you could possibly ought to count on browser scraping tools—limited to five hundred filtered URLs at a time. Not suitable.
Effectiveness → Search engine results:
This export offers you a listing of internet pages receiving search impressions. Even though the export is proscribed, you can use Google Search Console API for bigger datasets. In addition there are no cost Google Sheets plugins that simplify pulling additional comprehensive information.
Indexing → Internet pages report:
This segment delivers exports filtered by concern style, even though they are also constrained in scope.
Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a superb supply for gathering URLs, by using a generous limit of a hundred,000 URLs.
Better yet, you may use filters to produce various URL lists, effectively surpassing the 100k limit. For example, if you need to export only weblog URLs, observe these actions:
Stage one: Increase a phase towards the report
Stage two: Click on “Develop a new segment.”
Move 3: Define the section that has a narrower URL pattern, such as URLs that contains /website/
Note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer worthwhile insights.
Server log information
Server or CDN log data files are Maybe the ultimate Resource at your disposal. These logs seize an exhaustive checklist of every URL route queried by buyers, Googlebot, or other bots throughout the recorded period.
Concerns:
Information sizing: Log information can be large, numerous websites only keep the last two weeks of knowledge.
Complexity: Examining log files is often challenging, but a variety of instruments can be obtained to simplify the method.
Blend, and fantastic luck
As you’ve collected URLs from all of these sources, it’s time to mix them. If your internet site is small enough, use Excel or, for larger datasets, applications like Google Sheets or Jupyter Notebook. Assure all URLs are persistently formatted, then deduplicate the record.
And voilà—you now have a comprehensive list of present, previous, and archived URLs. Very good luck!