How to Identify All URLs and Broken links of Website ?

If you are curious to identify all the URL’s of your website and want to do something with it, there is a way to get those URLs using “Linkcheker”

LinkChecker is a free, GPL licensed website validator. LinkChecker checks links in web documents or full websites. It runs on Python 2 systems, requiring Python 2.7.2 or later.” So it can help us to know all the URL’s from a website and also report those URL’s which are broken and not working.

To install Linkchecker on ubuntu, follow below steps,

$ sudo apt-get install linkchecker

We have installed it for Ubuntu 20.04 you can download other platforms from if necessary.

Now, Lets test with our another simple html website, [ You can change the URL to your website name ]

$ linkchecker -v -F text/website-urls.txt 
INFO 2017-08-12 12:00:23,486 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 1 seconds
10 threads active, 53 links queued, 92 links in 12 URLs checked, runtime 6 seconds
10 threads active, 22 links queued, 183 links in 21 URLs checked, runtime 11 seconds
3 threads active, 0 links queued, 212 links in 31 URLs checked, runtime 16 seconds 

So the above command, prints all URLs with “-v” verbose mode, -F text/website-urls.txt saves to output file “website-urls.txt” in “text” mode.

You can check other command line parameters of “linkchecker” using,

$ linkchecker --help 

Now, lets try to identify which are the real URL’s in html exists in this website. The linkchecker output in text for a single URL is something like this,

URL `css/logo-nav.css’
Parent URL, line 19, col 5
Real URL
Check time 1.657 seconds
Size 2KB
Result Valid: 200 OK

Subscribe with Valid Email Id to receive updates in Inbox. ( Secured by Google FeedBurner )


URL `index.html’
Name `\n Byteslices Technologies\n ‘
Parent URL, line 58, col 17
Real URL
Check time 2.154 seconds
D/L time 0.038 seconds
Size 9KB
Result Valid: 200 OK

This just shows the section with two URLs, one with css and one with html, and we want to know only html, hence we will use grep command on the logs we collected like below,

$ cat website-urls.txt | grep "Real URL" | grep html > only-html-links.txt 

Above command will remove only html links and save it to another text file “only-html-links.txt”

Now if you observe this file, will have some duplicated lines which had came from css related URLs, so Lets remove those duplicated lines using “sort” command as below,

$ sort only-html-links.txt | uniq
Real URL
Real URL
Real URL
Real URL
Real URL 

So, now we got all uniq links with html extention from this website, Now, lets remove “Real URL” text from this file. To do this, we will save this output of above command to text file as,

$ sort only-html-links.txt | uniq > only_uniq_links.txt 

Open this text file “only_uniq_links.txt” in gedit and use find and replace with Find as “Real URL ” and replace as “Nothing” and click “Replace All”

and…. Done.. you get URLs like below in “only_uniq_links.txt” text file.

Leave a Comment

Android Android Build system Android Commands Android Java Applications Application Libraries Application Stack / User Interface Bash / Shell Scripts Bluetooth driver Cloud Technologies Commands and Packages Compilation Content Management System Core Kernel C Programs Development & Build Development, Debugging and Performance Tools Development Environment Setup Django & REST Api Errors & Failures Git Hardware Platforms HTML JAVA Programs Linux, OS Concepts and Networking Linux Device Drivers Linux Host, Ubuntu, SysAdmin Linux Kernel Linux Networking Middleware Libraries, HAL Multimedia Audio, Video, Images NDK / Middleware / HAL OS Concepts PHP Programming Languages Scripting and Automation Search Engine Optimisation ( SEO ) Social Media Source Code Management ( SCM ) System Administration, Security Testing and Debugging Uncategorized Web design and development Website Hosting Wordpress Yocto / Bitbake / Openembedded