Wget-Commands.txt

Continue an Incomplete Download

#Spider Websites with Wget – 20 Practical Examples

Wget is extremely powerful, but like with most other command line programs,

the plethora of options it supports can be intimidating to new users. Thus

what we have here are a collection of wget commands that you can use to

accomplish common tasks from downloading single files to mirroring entire

websites. It will help if you can read through the wget manual but for the

busy souls, these commands are ready to execute.

1. Download a single file from the Internet
wget http://example.com/file.iso

2. Download a file but save it locally under a different name
wget --output-document=filename.html example.com

3. Download a file and save it in a specific folder
wget --directory-prefix=folder/subfolder example.com

4. Resume an interrupted download previously started by wget itself
wget --continue example.com/big.file.iso

5. Download a file but only if the version on server is newer than your

local copy
wget --continue --timestamping wordpress.org/latest.zip

6. Download multiple URLs with wget. Put the list of URLs in another text

file on separate lines and pass it to wget.
wget --input list-of-file-urls.txt

7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg

8. Download a web page with all assets – like stylesheets and inline images

– that are required to properly display the web page offline.
wget --page-requisites --span-hosts --convert-links --adjust-extension

http://example.com/dir/file

Mirror websites with Wget

9. Download an entire website including all the linked pages and files
wget --execute robots=off --recursive --no-parent --continue --no-clobber

http://example.com/

10. Download all the MP3 files from a sub directory
wget --level=1 --recursive --no-parent --accept mp3,MP3

http://example.com/mp3/

11. Download all images from a website in a common folder
wget --directory-prefix=files/pictures --no-directories --recursive --no-

clobber --accept jpg,gif,png,jpeg http://example.com/images/

12. Download the PDF documents from a website through recursion but stay

within specific domains.
wget --mirror --domains=abc.com,files.abc.com,docs.abc.com --accept=pdf

http://abc.com/

13. Download all files from a website but exclude a few directories.
wget --recursive --no-clobber --no-parent --exclude-directories

/forums,/support http://example.com

Wget for Downloading Restricted Content

Wget can be used for downloading content from sites that are behind a login

screen or ones that check for the HTTP referer and the User Agent strings of

the bot to prevent screen scraping.

14. Download files from websites that check the User Agent and the HTTP

Referer
wget --refer=http://google.com --user-agent=”Mozilla/5.0 Firefox/4.0.1?

http://nytimes.com

15. Download files from a password protected sites
wget --http-user=labnol --http-password=hello123

http://example.com/secret/file.zip

16. Fetch pages that are behind a login page. You need to replace user and

password with the actual form fields while the URL should point to the Form

Submit (action) page.
wget --cookies=on --save-cookies cookies.txt --keep-session-cookies --post-

data ‘user=labnol&password=123' http://example.com/login.php
wget --cookies=on --load-cookies cookies.txt --keep-session-cookies

http://example.com/paywall

Retrieve File Details with wget

17. Find the size of a file without downloading it (look for Content Length

in the response, the size is in bytes)
wget --spider --server-response http://example.com/file.iso

18. Download a file and display the content on screen without saving it

locally.
wget --output-document – --quiet google.com/humans.txt

wget
19. Know the last modified date of a web page (check the Last Modified tag

in the HTTP header).
wget --server-response --spider http://www.labnol.org/

20. Check the links on your website to ensure that they are working. The

spider option will not save the pages locally.
wget --output-file=logfile.txt --recursive --spider http://example.com

Also see: Essential Linux Commands

Wget – How to be nice to the server?

The wget tool is essentially a spider that scrapes / leeches web pages but

some web hosts may block these spiders with the robots.txt files. Also, wget

will not follow links on web pages that use the rel=nofollow attribute.

You can however force wget to ignore the robots.txt and the nofollow

directives by adding the switch --execute robots=off to all your wget

commands. If a web host is blocking wget requests by looking at the User

Agent string, you can always fake that with the --user-agent=Mozilla switch.

The wget command will put additional strain on the site’s server because it

will continuously traverse the links and download files. A good scraper

would therefore limit the retrieval rate and also include a wait period

between consecutive fetch requests to reduce the server load.

wget --limit-rate=20k --wait=60 --random-wait --mirror example.com

In the above example, we have limited the download bandwidth rate to 20 KB/s

and the wget utility will wait anywhere between 30s and 90 seconds before

retrieving the next resource.

Finally, a little quiz. What do you think this wget command will do?
wget --span-hosts --level=inf --recursive dmoz.org
-------------------------------------------------------
How to Download and Save YouTube Videos With Download VLC
Open VLC and click "Open Media"
Click "Network" and paste YouTube URL
If using Windows, select "Tools" and then "Codec Information"
Find the "Location" bar at the bottom and copy that URL
Paste that URL into your browser
Right click the video and select "Save Video As"
Name the file and save to desired location
--------------------------------------------------------
wget -c file
Mirror an Entire Website Anything
wget -m http://example.com
--convert-links
--page-requisites
--no-parent
Download an Entire Directory
wget -r ftp://example.com/folder
Download a List of Files at Once
wget -i download.txt
A Few More Tricks
wget http://website.com/files/file.zip
Basic startup options
-V, --version
-h, --help
-b, --background
-o logfile,
--output-file=logfile
-F, --force-html
-q, --quiet
-B URL
--base=URL
-------------------------------------------------------
wget -i URL.txt
wget -r -l1 -A.bz2 http://aaa.com/directory
wget https://www.kernel.org/pub/linux/kernel/v3.0/linux-3.2.{1..15}.tar.bz2

C:\>wget -help
GNU Wget 1.21.2, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
  -V,  --version                   display the version of Wget and exit
  -h,  --help                      print this help
  -b,  --background                go to background after startup
  -e,  --execute=COMMAND           execute a `.wgetrc'-style command

Logging and input file:
  -o,  --output-file=FILE          log messages to FILE
  -a,  --append-output=FILE        append messages to FILE
  -d,  --debug                     print lots of debugging information
  -q,  --quiet                     quiet (no output)
  -v,  --verbose                   be verbose (this is the default)
  -nv, --no-verbose                turn off verboseness, without being quiet
       --report-speed=TYPE         output bandwidth as TYPE.  TYPE can be

bits
  -i,  --input-file=FILE           download URLs found in local or external

FILE
       --input-metalink=FILE       download files covered in local Metalink

FILE
  -F,  --force-html                treat input file as HTML
  -B,  --base=URL                  resolves HTML input-file links (-i -F)
                                     relative to URL
       --config=FILE               specify config file to use
       --no-config                 do not read any config file
       --rejected-log=FILE         log reasons for URL rejection to FILE

Download:
  -t,  --tries=NUMBER              set number of retries to NUMBER (0

unlimits)
       --retry-connrefused         retry even if connection is refused
       --retry-on-http-error=ERRORS    comma-separated list of HTTP errors

to retry
  -O,  --output-document=FILE      write documents to FILE
  -nc, --no-clobber                skip downloads that would download to
                                     existing files (overwriting them)
       --no-netrc                  don't try to obtain credentials from

.netrc
  -c,  --continue                  resume getting a partially-downloaded

file
       --start-pos=OFFSET          start downloading from zero-based

position OFFSET
       --progress=TYPE             select progress gauge type
       --show-progress             display the progress bar in any verbosity

mode
  -N,  --timestamping              don't re-retrieve files unless newer than
                                     local
       --no-if-modified-since      don't use conditional if-modified-since

get
                                     requests in timestamping mode
       --no-use-server-timestamps  don't set the local file's timestamp by
                                     the one on the server
  -S,  --server-response           print server response
       --spider                    don't download anything
  -T,  --timeout=SECONDS           set all timeout values to SECONDS
       --dns-servers=ADDRESSES     list of DNS servers to query (comma

separated)
       --bind-dns-address=ADDRESS  bind DNS resolver to ADDRESS (hostname or

IP) on local host
       --dns-timeout=SECS          set the DNS lookup timeout to SECS
       --connect-timeout=SECS      set the connect timeout to SECS
       --read-timeout=SECS         set the read timeout to SECS
  -w,  --wait=SECONDS              wait SECONDS between retrievals
                                     (applies if more then 1 URL is to be

retrieved)
       --waitretry=SECONDS         wait 1..SECONDS between retries of a

retrieval
                                     (applies if more then 1 URL is to be

retrieved)
       --random-wait               wait from 0.5*WAIT...1.5*WAIT secs

between retrievals
                                     (applies if more then 1 URL is to be

retrieved)
       --no-proxy                  explicitly turn off proxy
  -Q,  --quota=NUMBER              set retrieval quota to NUMBER
       --bind-address=ADDRESS      bind to ADDRESS (hostname or IP) on local

host
       --limit-rate=RATE           limit download rate to RATE
       --no-dns-cache              disable caching DNS lookups
       --restrict-file-names=OS    restrict chars in file names to ones OS

allows
       --ignore-case               ignore case when matching

files/directories
  -4,  --inet4-only                connect only to IPv4 addresses
  -6,  --inet6-only                connect only to IPv6 addresses
       --prefer-family=FAMILY      connect first to addresses of specified

family,
                                     one of IPv6, IPv4, or none
       --user=USER                 set both ftp and http user to USER
       --password=PASS             set both ftp and http password to PASS
       --ask-password              prompt for passwords
       --use-askpass=COMMAND       specify credential handler for requesting
                                     username and password.  If no COMMAND

is
                                     specified the WGET_ASKPASS or the

SSH_ASKPASS
                                     environment variable is used.
       --no-iri                    turn off IRI support
       --local-encoding=ENC        use ENC as the local encoding for IRIs
       --remote-encoding=ENC       use ENC as the default remote encoding
       --unlink                    remove file before clobber
       --keep-badhash              keep files with checksum mismatch (append

.badhash)
       --metalink-index=NUMBER     Metalink application/metalink4+xml

metaurl ordinal NUMBER
       --metalink-over-http        use Metalink metadata from HTTP response

headers
       --preferred-location        preferred location for Metalink resources

Directories:
  -nd, --no-directories            don't create directories
  -x,  --force-directories         force creation of directories
  -nH, --no-host-directories       don't create host directories
       --protocol-directories      use protocol name in directories
  -P,  --directory-prefix=PREFIX   save files to PREFIX/..
       --cut-dirs=NUMBER           ignore NUMBER remote directory components

HTTP options:
       --http-user=USER            set http user to USER
       --http-password=PASS        set http password to PASS
       --no-cache                  disallow server-cached data
       --default-page=NAME         change the default page name (normally
                                     this is 'index.html'.)
  -E,  --adjust-extension          save HTML/CSS documents with proper

extensions
       --ignore-length             ignore 'Content-Length' header field
       --header=STRING             insert STRING among the headers
       --compression=TYPE          choose compression, one of auto, gzip and

none. (default: none)
       --max-redirect              maximum redirections allowed per page
       --proxy-user=USER           set USER as proxy username
       --proxy-password=PASS       set PASS as proxy password
       --referer=URL               include 'Referer: URL' header in HTTP

request
       --save-headers              save the HTTP headers to file
  -U,  --user-agent=AGENT          identify as AGENT instead of Wget/VERSION
       --no-http-keep-alive        disable HTTP keep-alive (persistent

connections)
       --no-cookies                don't use cookies
       --load-cookies=FILE         load cookies from FILE before session
       --save-cookies=FILE         save cookies to FILE after session
       --keep-session-cookies      load and save session (non-permanent)

cookies
       --post-data=STRING          use the POST method; send STRING as the

data
       --post-file=FILE            use the POST method; send contents of

FILE
       --method=HTTPMethod         use method "HTTPMethod" in the request
       --body-data=STRING          send STRING as data. --method MUST be set
       --body-file=FILE            send contents of FILE. --method MUST be

set
       --content-disposition       honor the Content-Disposition header when
                                     choosing local file names

(EXPERIMENTAL)
       --content-on-error          output the received content on server

errors
       --auth-no-challenge         send Basic HTTP authentication

information
                                     without first waiting for the server's
                                     challenge

HTTPS (SSL/TLS) options:
       --secure-protocol=PR        choose secure protocol, one of auto,

SSLv2,
                                     SSLv3, TLSv1, TLSv1_1, TLSv1_2 and PFS
       --https-only                only follow secure HTTPS links
       --no-check-certificate      don't validate the server's certificate
       --certificate=FILE          client certificate file
       --certificate-type=TYPE     client certificate type, PEM or DER
       --private-key=FILE          private key file
       --private-key-type=TYPE     private key type, PEM or DER
       --ca-certificate=FILE       file with the bundle of CAs
       --ca-directory=DIR          directory where hash list of CAs is

stored
       --crl-file=FILE             file with bundle of CRLs
       --pinnedpubkey=FILE/HASHES  Public key (PEM/DER) file, or any number
                                   of base64 encoded sha256 hashes preceded

by
                                   'sha256//' and separated by ';', to

verify
                                   peer against
       --random-file=FILE          file with random data for seeding the SSL

PRNG

       --ciphers=STR           Set the priority string (GnuTLS) or cipher

list string (OpenSSL) directly.
                                   Use with care. This option overrides --

secure-protocol.
                                   The format and syntax of this string

depend on the specific SSL/TLS engine.
HSTS options:
       --no-hsts                   disable HSTS
       --hsts-file                 path of HSTS database (will override

default)

FTP options:
       --ftp-user=USER             set ftp user to USER
       --ftp-password=PASS         set ftp password to PASS
       --no-remove-listing         don't remove '.listing' files
       --no-glob                   turn off FTP file name globbing
       --no-passive-ftp            disable the "passive" transfer mode
       --preserve-permissions      preserve remote file permissions
       --retr-symlinks             when recursing, get linked-to files (not

dir)

FTPS options:
       --ftps-implicit                 use implicit FTPS (default port is

990)
       --ftps-resume-ssl               resume the SSL/TLS session started in

the control connection when
                                         opening a data connection
       --ftps-clear-data-connection    cipher the control channel only; all

the data will be in plaintext
       --ftps-fallback-to-ftp          fall back to FTP if FTPS is not

supported in the target server
WARC options:
       --warc-file=FILENAME        save request/response data to a .warc.gz

file
       --warc-header=STRING        insert STRING into the warcinfo record
       --warc-max-size=NUMBER      set maximum size of WARC files to NUMBER
       --warc-cdx                  write CDX index files
       --warc-dedup=FILENAME       do not store records listed in this CDX

file
       --no-warc-compression       do not compress WARC files with GZIP
       --no-warc-digests           do not calculate SHA1 digests
       --no-warc-keep-log          do not store the log file in a WARC

record
       --warc-tempdir=DIRECTORY    location for temporary files created by

the
                                     WARC writer

Recursive download:
  -r,  --recursive                 specify recursive download
  -l,  --level=NUMBER              maximum recursion depth (inf or 0 for

infinite)
       --delete-after              delete files locally after downloading

them
  -k,  --convert-links             make links in downloaded HTML or CSS

point to
                                     local files
       --convert-file-only         convert the file part of the URLs only

(usually known as the basename)
       --backups=N                 before writing file X, rotate up to N

backup files
  -K,  --backup-converted          before converting file X, back up as

X.orig
  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-

listing
  -p,  --page-requisites           get all images, etc. needed to display

HTML page
       --strict-comments           turn on strict (SGML) handling of HTML

comments

Recursive accept/reject:
  -A,  --accept=LIST               comma-separated list of accepted

extensions
  -R,  --reject=LIST               comma-separated list of rejected

extensions
       --accept-regex=REGEX        regex matching accepted URLs
       --reject-regex=REGEX        regex matching rejected URLs
       --regex-type=TYPE           regex type (posix|pcre)
  -D,  --domains=LIST              comma-separated list of accepted domains
       --exclude-domains=LIST      comma-separated list of rejected domains
       --follow-ftp                follow FTP links from HTML documents
       --follow-tags=LIST          comma-separated list of followed HTML

tags
       --ignore-tags=LIST          comma-separated list of ignored HTML tags
  -H,  --span-hosts                go to foreign hosts when recursive
  -L,  --relative                  follow relative links only
  -I,  --include-directories=LIST  list of allowed directories
       --trust-server-names        use the name specified by the redirection
                                     URL's last component
  -X,  --exclude-directories=LIST  list of excluded directories
  -np, --no-parent                 don't ascend to the parent directory

Email bug reports, questions, discussions to <[email protected]>
and/or open issues at https://savannah.gnu.org/bugs/?

func=additem&group=wget.