How to Extract Links and Text from a Web Page with Lynx
Lynx browser is a text-only Web browser that can be used to extract text and links from a Web page. Google has recommended it in their SEO guidelines since the early days.
Lynx doesn’t render JavaScript, images, videos, or non-text content, so it can show you what a bot might see when it crawls the page. Lynx will only show the server-rendered HTML, not the JavaScript content, so it’s a quick way to see what parts of the content are server-rendered.
At the bottom of Lynx’s output will be a list of all the visible links on the page.
First, I’ll cover how to install the Lynx browser, and then I’ll go over its basic usage and Google’s recommendations.
I’ve also made a video that walks through the steps:
How to Install Lynx
Lynx is launched from a terminal. To check if it’s already installed, open a terminal and type this command:
lynx --version
If it’s installed, you should see output that is similar to this:
Lynx Version 2.8.9rel.1 (08 Jul 2018)
libwww-FM 2.14, SSL-MM 1.4.1, OpenSSL 1.1.1q, ncurses 5.7.20081102
Built on darwin21.1.0 (Oct 25 2021 04:30:19).
Copyrights held by the Lynx Developers Group,
the University of Kansas, CERN, and other contributors.
Distributed under the GNU General Public License (Version 2).
See https://lynx.invisible-island.net/ and the online help for more information.
See http://www.openssl.org/ for information about OpenSSL.
If it isn’t already installed, it’s easy to install on Linux, Mac, and Windows.
Linux
On Ubuntu, you can use the apt-get
command:
sudo apt-get update
sudo apt-get install lynx
For other Linux distros, use the disto’s package manager to install the lynx
package.
Mac
If you’re using Mac, you can install Lynx with Homebrew.
Windows
Lynx can be installed in WSL in the same was as for Ubuntu.
If you’re using a package manager like Scoop or Chocolatey, search for the lynx
package.
Basic Lynx Usage
Here is the basic command to dump the text content and links from a Web page:
lynx --dump <url>
The <url>
part should be replaced with an actual URL. Here’s an example:
lynx --dump https://example.com/
And here’s the output of the command:
Example Domain
This domain is for use in illustrative examples in documents. You may
use this domain in literature without prior coordination or asking for
permission.
[1]More information...
References
1. https://www.iana.org/domains/example
Notice the list of links at the bottom of the output. Example.com
only has one link on the page, so there was only one URL in the list. If you try it on a different URL with more links on the page, the list will be longer.
Here’s a screenshot of the output for the Hacker News homepage as an example:
Extracting a List of Links from a Web Page
There’s a cleaner way to extract links with Lynx.
- The option
--listonly
will print out only the list of links. - The option
--nonumbers
will print out the links without line numbers. - The option
--display_charset=utf-8
will get rid of weird characters in the output, if you run into problems with that.
Here’s an example command that combines all of those flags:
lynx --listonly \
--nonumbers \
--display_charset=utf-8 \
--dump https://www.nytimes.com/
(Note: the backslashes there allow the command to be split up onto multiple lines.)
Here’s what the output looks like without line numbers:
Not having line numbers there can make it easier to process the links with other scripts.
For example, you can use the pipe character (|
) to send the output of Lynx into the grep
command in order to print out only the lines that contain URLs:
lynx --listonly \
--nonumbers \
--display_charset=utf-8 \
--dump https://www.nytimes.com/ \
| grep "^http"
If every line contains a URL, you can then sort them and filter for unique URLs like this:
lynx --listonly \
--nonumbers \
--display_charset=utf-8 \
--dump https://www.nytimes.com/ \
| grep "^http" \
| sort \
| uniq
To save the output in a file, you can use the >
sign at the end:
lynx --listonly \
--nonumbers \
--display_charset=utf-8 \
--dump https://www.nytimes.com/ \
| grep "^http" \
| sort \
| uniq \
> links.txt
Creating a Reusable Script
If you have a long terminal command that will be used often, you can create a reusable shell script function.
First, find your shell configuration file. It will often be called something like ~/.zshrc
or ~/.bashrc
. Paste this code in the bottom of that file:
extract_links () {
lynx --listonly \
--nonumbers \
--display_charset=utf-8 \
--dump "$1" \
| grep "^http" \
| sort \
| uniq
}
Then open a new terminal window and run the command like this:
extract_links https://www.nytimes.com/
It will then print the sorted, unique links from the Web page:
If you want to save that output into a file use the >
character and a filename:
extract_links https://www.nytimes.com/ > links.txt
Google’s Recommendations
Google recommends using Lynx as an SEO tool:
Use a text browser such as Lynx to examine your site, since many search engines see your site much as Lynx would. If features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.
The basic command to check a page is the first one mentioned above:
lynx --dump https://example.com/
To learn more about Lynx, try the tutorial on how to write a broken link checker with Lynx.
If you like Lynx you might also be interested in the curl tutorial.