If you are curious to identify all the URL’s of your website and want to do something with it, there is a way to get those URLs using “Linkcheker”
” LinkChecker is a free, GPL licensed website validator. LinkChecker checks links in web documents or full websites. It runs on Python 2 systems, requiring Python 2.7.2 or later.” So it can help us to know all the URL’s from a website and also report those URL’s which are broken and not working.
To install Linkchecker on ubuntu, follow below steps,
$ sudo apt-get install linkchecker
We have installed it for Ubuntu 20.04 you can download other platforms from http://ftp.debian.org/debian/pool/main/l/linkchecker/ if necessary.
Now, Lets test with our another simple html website, [ You can change the URL to your website name ]
$ linkchecker http://www.byteslices.com -v -F text/website-urls.txt
INFO 2017-08-12 12:00:23,486 MainThread Checking intern URLs only; use --check-extern to check extern URLs. 1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 1 seconds 10 threads active, 53 links queued, 92 links in 12 URLs checked, runtime 6 seconds 10 threads active, 22 links queued, 183 links in 21 URLs checked, runtime 11 seconds 3 threads active, 0 links queued, 212 links in 31 URLs checked, runtime 16 seconds
So the above command, prints all URLs with “-v” verbose mode, -F text/website-urls.txt saves to output file “website-urls.txt” in “text” mode.
You can check other command line parameters of “linkchecker” using,
$ linkchecker --help
Now, lets try to identify which are the real URL’s in html exists in this website. The linkchecker output in text for a single URL is something like this,
Parent URL http://www.byteslices.com, line 19, col 5
Real URL http://www.byteslices.com/css/logo-nav.css
Check time 1.657 seconds
Result Valid: 200 OK
Name `\n Byteslices Technologies\n ‘
Parent URL http://www.byteslices.com, line 58, col 17
Real URL http://www.byteslices.com/index.html
Check time 2.154 seconds
D/L time 0.038 seconds
Result Valid: 200 OK
This just shows the section with two URLs, one with css and one with html, and we want to know only html, hence we will use grep command on the logs we collected like below,
$ cat website-urls.txt | grep "Real URL" | grep html > only-html-links.txt
Above command will remove only html links and save it to another text file “only-html-links.txt”
Now if you observe this file, will have some duplicated lines which had came from css related URLs, so Lets remove those duplicated lines using “sort” command as below,
$ sort only-html-links.txt | uniq Real URL http://www.byteslices.com/about.html Real URL http://www.byteslices.com/contact.html Real URL http://www.byteslices.com/embedded-iot.html Real URL http://www.byteslices.com/index.html Real URL http://www.byteslices.com/web-technologies.html
So, now we got all uniq links with html extention from this website, Now, lets remove “Real URL” text from this file. To do this, we will save this output of above command to text file as,
$ sort only-html-links.txt | uniq > only_uniq_links.txt
Open this text file “only_uniq_links.txt” in gedit and use find and replace with Find as “Real URL ” and replace as “Nothing” and click “Replace All”
and…. Done.. you get URLs like below in “only_uniq_links.txt” text file.
http://www.byteslices.com/about.html http://www.byteslices.com/contact.html http://www.byteslices.com/embedded-iot.html http://www.byteslices.com/index.html http://www.byteslices.com/web-technologies.html