How to Extract Links and Text from a Web Page with Lynx

Updated on

Lynx browser is a text-only Web browser that can be used to extract text and links from a Web page. Google has recommended it in their SEO guidelines since the early days.

Lynx doesn’t render JavaScript, images, videos, or non-text content, so it can show you what a bot might see when it crawls the page. Lynx will only show the server-rendered HTML, not the JavaScript content, so it’s a quick way to see what parts of the content are server-rendered.

At the bottom of Lynx’s output will be a list of all the visible links on the page.

First, I’ll cover how to install the Lynx browser, and then I’ll go over its basic usage and Google’s recommendations.

I’ve also made a video that walks through the steps:

How to Install Lynx

Lynx is launched from a terminal. To check if it’s already installed, open a terminal and type this command:

lynx --version

If it’s installed, you should see output that is similar to this:

Lynx Version 2.8.9rel.1 (08 Jul 2018)
libwww-FM 2.14, SSL-MM 1.4.1, OpenSSL 1.1.1q, ncurses 5.7.20081102
Built on darwin21.1.0 (Oct 25 2021 04:30:19).

Copyrights held by the Lynx Developers Group,
the University of Kansas, CERN, and other contributors.
Distributed under the GNU General Public License (Version 2).
See https://lynx.invisible-island.net/ and the online help for more information.

See http://www.openssl.org/ for information about OpenSSL.

If it isn’t already installed, it’s easy to install on Linux, Mac, and Windows.

Linux

On Ubuntu, you can use the apt-get command:

sudo apt-get update
sudo apt-get install lynx

For other Linux distros, use the disto’s package manager to install the lynx package.

Mac

If you’re using Mac, you can install Lynx with Homebrew.

Windows

Lynx can be installed in WSL in the same was as for Ubuntu.

If you’re using a package manager like Scoop or Chocolatey, search for the lynx package.

Basic Lynx Usage

Here is the basic command to dump the text content and links from a Web page:

lynx --dump <url>

The <url> part should be replaced with an actual URL. Here’s an example:

lynx --dump https://example.com/

And here’s the output of the command:

Example Domain

   This domain is for use in illustrative examples in documents. You may
   use this domain in literature without prior coordination or asking for
   permission.

   [1]More information...

References

   1. https://www.iana.org/domains/example

Notice the list of links at the bottom of the output. Example.com only has one link on the page, so there was only one URL in the list. If you try it on a different URL with more links on the page, the list will be longer.

Here’s a screenshot of the output for the Hacker News homepage as an example:

Links from the homepage of Hacker News, scraped with Lynx browser

There’s a cleaner way to extract links with Lynx.

Here’s an example command that combines all of those flags:

lynx --listonly \
    --nonumbers \
    --display_charset=utf-8 \
    --dump https://www.nytimes.com/

(Note: the backslashes there allow the command to be split up onto multiple lines.)

Here’s what the output looks like without line numbers:

Lynx output without line numbers

Not having line numbers there can make it easier to process the links with other scripts.

For example, you can use the pipe character (|) to send the output of Lynx into the grep command in order to print out only the lines that contain URLs:

lynx --listonly \
    --nonumbers \
    --display_charset=utf-8 \
    --dump https://www.nytimes.com/ \
    | grep "^http"

If every line contains a URL, you can then sort them and filter for unique URLs like this:

lynx --listonly \
    --nonumbers \
    --display_charset=utf-8 \
    --dump https://www.nytimes.com/ \
    | grep "^http" \
    | sort \
    | uniq

To save the output in a file, you can use the > sign at the end:

lynx --listonly \
    --nonumbers \
    --display_charset=utf-8 \
    --dump https://www.nytimes.com/ \
    | grep "^http" \
    | sort \
    | uniq \
    > links.txt

Creating a Reusable Script

If you have a long terminal command that will be used often, you can create a reusable shell script function.

First, find your shell configuration file. It will often be called something like ~/.zshrc or ~/.bashrc. Paste this code in the bottom of that file:

extract_links () {
    lynx --listonly \
    --nonumbers \
    --display_charset=utf-8 \
    --dump "$1" \
    | grep "^http" \
    | sort \
    | uniq
}

Then open a new terminal window and run the command like this:

extract_links https://www.nytimes.com/

It will then print the sorted, unique links from the Web page:

Shell script function output

If you want to save that output into a file use the > character and a filename:

extract_links https://www.nytimes.com/ > links.txt

Google’s Recommendations

Google recommends using Lynx as an SEO tool:

Use a text browser such as Lynx to examine your site, since many search engines see your site much as Lynx would. If features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.

The basic command to check a page is the first one mentioned above:

lynx --dump https://example.com/

To learn more about Lynx, try the tutorial on how to write a broken link checker with Lynx.

If you like Lynx you might also be interested in the curl tutorial.

Tagged with: SEO Web ScrapingLynx

Feedback and Comments

What did you think about this page? Do you have any questions, or is there anything that could be improved? You can leave a comment after clicking on an icon below.