Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Continue an Incomplete Download
- #Spider Websites with Wget – 20 Practical Examples
- Wget is extremely powerful, but like with most other command line programs,
- the plethora of options it supports can be intimidating to new users. Thus
- what we have here are a collection of wget commands that you can use to
- accomplish common tasks from downloading single files to mirroring entire
- websites. It will help if you can read through the wget manual but for the
- busy souls, these commands are ready to execute.
- 1. Download a single file from the Internet
- wget http://example.com/file.iso
- 2. Download a file but save it locally under a different name
- wget --output-document=filename.html example.com
- 3. Download a file and save it in a specific folder
- wget --directory-prefix=folder/subfolder example.com
- 4. Resume an interrupted download previously started by wget itself
- wget --continue example.com/big.file.iso
- 5. Download a file but only if the version on server is newer than your
- local copy
- wget --continue --timestamping wordpress.org/latest.zip
- 6. Download multiple URLs with wget. Put the list of URLs in another text
- file on separate lines and pass it to wget.
- wget --input list-of-file-urls.txt
- 7. Download a list of sequentially numbered files from a server
- wget http://example.com/images/{1..20}.jpg
- 8. Download a web page with all assets – like stylesheets and inline images
- – that are required to properly display the web page offline.
- wget --page-requisites --span-hosts --convert-links --adjust-extension
- http://example.com/dir/file
- Mirror websites with Wget
- 9. Download an entire website including all the linked pages and files
- wget --execute robots=off --recursive --no-parent --continue --no-clobber
- http://example.com/
- 10. Download all the MP3 files from a sub directory
- wget --level=1 --recursive --no-parent --accept mp3,MP3
- http://example.com/mp3/
- 11. Download all images from a website in a common folder
- wget --directory-prefix=files/pictures --no-directories --recursive --no-
- clobber --accept jpg,gif,png,jpeg http://example.com/images/
- 12. Download the PDF documents from a website through recursion but stay
- within specific domains.
- wget --mirror --domains=abc.com,files.abc.com,docs.abc.com --accept=pdf
- http://abc.com/
- 13. Download all files from a website but exclude a few directories.
- wget --recursive --no-clobber --no-parent --exclude-directories
- /forums,/support http://example.com
- Wget for Downloading Restricted Content
- Wget can be used for downloading content from sites that are behind a login
- screen or ones that check for the HTTP referer and the User Agent strings of
- the bot to prevent screen scraping.
- 14. Download files from websites that check the User Agent and the HTTP
- Referer
- wget --refer=http://google.com --user-agent=”Mozilla/5.0 Firefox/4.0.1?
- http://nytimes.com
- 15. Download files from a password protected sites
- wget --http-user=labnol --http-password=hello123
- http://example.com/secret/file.zip
- 16. Fetch pages that are behind a login page. You need to replace user and
- password with the actual form fields while the URL should point to the Form
- Submit (action) page.
- wget --cookies=on --save-cookies cookies.txt --keep-session-cookies --post-
- data ‘user=labnol&password=123' http://example.com/login.php
- wget --cookies=on --load-cookies cookies.txt --keep-session-cookies
- http://example.com/paywall
- Retrieve File Details with wget
- 17. Find the size of a file without downloading it (look for Content Length
- in the response, the size is in bytes)
- wget --spider --server-response http://example.com/file.iso
- 18. Download a file and display the content on screen without saving it
- locally.
- wget --output-document – --quiet google.com/humans.txt
- wget
- 19. Know the last modified date of a web page (check the Last Modified tag
- in the HTTP header).
- wget --server-response --spider http://www.labnol.org/
- 20. Check the links on your website to ensure that they are working. The
- spider option will not save the pages locally.
- wget --output-file=logfile.txt --recursive --spider http://example.com
- Also see: Essential Linux Commands
- Wget – How to be nice to the server?
- The wget tool is essentially a spider that scrapes / leeches web pages but
- some web hosts may block these spiders with the robots.txt files. Also, wget
- will not follow links on web pages that use the rel=nofollow attribute.
- You can however force wget to ignore the robots.txt and the nofollow
- directives by adding the switch --execute robots=off to all your wget
- commands. If a web host is blocking wget requests by looking at the User
- Agent string, you can always fake that with the --user-agent=Mozilla switch.
- The wget command will put additional strain on the site’s server because it
- will continuously traverse the links and download files. A good scraper
- would therefore limit the retrieval rate and also include a wait period
- between consecutive fetch requests to reduce the server load.
- wget --limit-rate=20k --wait=60 --random-wait --mirror example.com
- In the above example, we have limited the download bandwidth rate to 20 KB/s
- and the wget utility will wait anywhere between 30s and 90 seconds before
- retrieving the next resource.
- Finally, a little quiz. What do you think this wget command will do?
- wget --span-hosts --level=inf --recursive dmoz.org
- -------------------------------------------------------
- How to Download and Save YouTube Videos With Download VLC
- Open VLC and click "Open Media"
- Click "Network" and paste YouTube URL
- If using Windows, select "Tools" and then "Codec Information"
- Find the "Location" bar at the bottom and copy that URL
- Paste that URL into your browser
- Right click the video and select "Save Video As"
- Name the file and save to desired location
- --------------------------------------------------------
- wget -c file
- Mirror an Entire Website Anything
- wget -m http://example.com
- --convert-links
- --page-requisites
- --no-parent
- Download an Entire Directory
- wget -r ftp://example.com/folder
- Download a List of Files at Once
- wget -i download.txt
- A Few More Tricks
- wget http://website.com/files/file.zip
- Basic startup options
- -V, --version
- -h, --help
- -b, --background
- -o logfile,
- --output-file=logfile
- -F, --force-html
- -q, --quiet
- -B URL
- --base=URL
- -------------------------------------------------------
- wget -i URL.txt
- wget -r -l1 -A.bz2 http://aaa.com/directory
- wget https://www.kernel.org/pub/linux/kernel/v3.0/linux-3.2.{1..15}.tar.bz2
- C:\>wget -help
- GNU Wget 1.21.2, a non-interactive network retriever.
- Usage: wget [OPTION]... [URL]...
- Mandatory arguments to long options are mandatory for short options too.
- Startup:
- -V, --version display the version of Wget and exit
- -h, --help print this help
- -b, --background go to background after startup
- -e, --execute=COMMAND execute a `.wgetrc'-style command
- Logging and input file:
- -o, --output-file=FILE log messages to FILE
- -a, --append-output=FILE append messages to FILE
- -d, --debug print lots of debugging information
- -q, --quiet quiet (no output)
- -v, --verbose be verbose (this is the default)
- -nv, --no-verbose turn off verboseness, without being quiet
- --report-speed=TYPE output bandwidth as TYPE. TYPE can be
- bits
- -i, --input-file=FILE download URLs found in local or external
- FILE
- --input-metalink=FILE download files covered in local Metalink
- FILE
- -F, --force-html treat input file as HTML
- -B, --base=URL resolves HTML input-file links (-i -F)
- relative to URL
- --config=FILE specify config file to use
- --no-config do not read any config file
- --rejected-log=FILE log reasons for URL rejection to FILE
- Download:
- -t, --tries=NUMBER set number of retries to NUMBER (0
- unlimits)
- --retry-connrefused retry even if connection is refused
- --retry-on-http-error=ERRORS comma-separated list of HTTP errors
- to retry
- -O, --output-document=FILE write documents to FILE
- -nc, --no-clobber skip downloads that would download to
- existing files (overwriting them)
- --no-netrc don't try to obtain credentials from
- .netrc
- -c, --continue resume getting a partially-downloaded
- file
- --start-pos=OFFSET start downloading from zero-based
- position OFFSET
- --progress=TYPE select progress gauge type
- --show-progress display the progress bar in any verbosity
- mode
- -N, --timestamping don't re-retrieve files unless newer than
- local
- --no-if-modified-since don't use conditional if-modified-since
- get
- requests in timestamping mode
- --no-use-server-timestamps don't set the local file's timestamp by
- the one on the server
- -S, --server-response print server response
- --spider don't download anything
- -T, --timeout=SECONDS set all timeout values to SECONDS
- --dns-servers=ADDRESSES list of DNS servers to query (comma
- separated)
- --bind-dns-address=ADDRESS bind DNS resolver to ADDRESS (hostname or
- IP) on local host
- --dns-timeout=SECS set the DNS lookup timeout to SECS
- --connect-timeout=SECS set the connect timeout to SECS
- --read-timeout=SECS set the read timeout to SECS
- -w, --wait=SECONDS wait SECONDS between retrievals
- (applies if more then 1 URL is to be
- retrieved)
- --waitretry=SECONDS wait 1..SECONDS between retries of a
- retrieval
- (applies if more then 1 URL is to be
- retrieved)
- --random-wait wait from 0.5*WAIT...1.5*WAIT secs
- between retrievals
- (applies if more then 1 URL is to be
- retrieved)
- --no-proxy explicitly turn off proxy
- -Q, --quota=NUMBER set retrieval quota to NUMBER
- --bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local
- host
- --limit-rate=RATE limit download rate to RATE
- --no-dns-cache disable caching DNS lookups
- --restrict-file-names=OS restrict chars in file names to ones OS
- allows
- --ignore-case ignore case when matching
- files/directories
- -4, --inet4-only connect only to IPv4 addresses
- -6, --inet6-only connect only to IPv6 addresses
- --prefer-family=FAMILY connect first to addresses of specified
- family,
- one of IPv6, IPv4, or none
- --user=USER set both ftp and http user to USER
- --password=PASS set both ftp and http password to PASS
- --ask-password prompt for passwords
- --use-askpass=COMMAND specify credential handler for requesting
- username and password. If no COMMAND
- is
- specified the WGET_ASKPASS or the
- SSH_ASKPASS
- environment variable is used.
- --no-iri turn off IRI support
- --local-encoding=ENC use ENC as the local encoding for IRIs
- --remote-encoding=ENC use ENC as the default remote encoding
- --unlink remove file before clobber
- --keep-badhash keep files with checksum mismatch (append
- .badhash)
- --metalink-index=NUMBER Metalink application/metalink4+xml
- metaurl ordinal NUMBER
- --metalink-over-http use Metalink metadata from HTTP response
- headers
- --preferred-location preferred location for Metalink resources
- Directories:
- -nd, --no-directories don't create directories
- -x, --force-directories force creation of directories
- -nH, --no-host-directories don't create host directories
- --protocol-directories use protocol name in directories
- -P, --directory-prefix=PREFIX save files to PREFIX/..
- --cut-dirs=NUMBER ignore NUMBER remote directory components
- HTTP options:
- --http-user=USER set http user to USER
- --http-password=PASS set http password to PASS
- --no-cache disallow server-cached data
- --default-page=NAME change the default page name (normally
- this is 'index.html'.)
- -E, --adjust-extension save HTML/CSS documents with proper
- extensions
- --ignore-length ignore 'Content-Length' header field
- --header=STRING insert STRING among the headers
- --compression=TYPE choose compression, one of auto, gzip and
- none. (default: none)
- --max-redirect maximum redirections allowed per page
- --proxy-user=USER set USER as proxy username
- --proxy-password=PASS set PASS as proxy password
- --referer=URL include 'Referer: URL' header in HTTP
- request
- --save-headers save the HTTP headers to file
- -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION
- --no-http-keep-alive disable HTTP keep-alive (persistent
- connections)
- --no-cookies don't use cookies
- --load-cookies=FILE load cookies from FILE before session
- --save-cookies=FILE save cookies to FILE after session
- --keep-session-cookies load and save session (non-permanent)
- cookies
- --post-data=STRING use the POST method; send STRING as the
- data
- --post-file=FILE use the POST method; send contents of
- FILE
- --method=HTTPMethod use method "HTTPMethod" in the request
- --body-data=STRING send STRING as data. --method MUST be set
- --body-file=FILE send contents of FILE. --method MUST be
- set
- --content-disposition honor the Content-Disposition header when
- choosing local file names
- (EXPERIMENTAL)
- --content-on-error output the received content on server
- errors
- --auth-no-challenge send Basic HTTP authentication
- information
- without first waiting for the server's
- challenge
- HTTPS (SSL/TLS) options:
- --secure-protocol=PR choose secure protocol, one of auto,
- SSLv2,
- SSLv3, TLSv1, TLSv1_1, TLSv1_2 and PFS
- --https-only only follow secure HTTPS links
- --no-check-certificate don't validate the server's certificate
- --certificate=FILE client certificate file
- --certificate-type=TYPE client certificate type, PEM or DER
- --private-key=FILE private key file
- --private-key-type=TYPE private key type, PEM or DER
- --ca-certificate=FILE file with the bundle of CAs
- --ca-directory=DIR directory where hash list of CAs is
- stored
- --crl-file=FILE file with bundle of CRLs
- --pinnedpubkey=FILE/HASHES Public key (PEM/DER) file, or any number
- of base64 encoded sha256 hashes preceded
- by
- 'sha256//' and separated by ';', to
- verify
- peer against
- --random-file=FILE file with random data for seeding the SSL
- PRNG
- --ciphers=STR Set the priority string (GnuTLS) or cipher
- list string (OpenSSL) directly.
- Use with care. This option overrides --
- secure-protocol.
- The format and syntax of this string
- depend on the specific SSL/TLS engine.
- HSTS options:
- --no-hsts disable HSTS
- --hsts-file path of HSTS database (will override
- default)
- FTP options:
- --ftp-user=USER set ftp user to USER
- --ftp-password=PASS set ftp password to PASS
- --no-remove-listing don't remove '.listing' files
- --no-glob turn off FTP file name globbing
- --no-passive-ftp disable the "passive" transfer mode
- --preserve-permissions preserve remote file permissions
- --retr-symlinks when recursing, get linked-to files (not
- dir)
- FTPS options:
- --ftps-implicit use implicit FTPS (default port is
- 990)
- --ftps-resume-ssl resume the SSL/TLS session started in
- the control connection when
- opening a data connection
- --ftps-clear-data-connection cipher the control channel only; all
- the data will be in plaintext
- --ftps-fallback-to-ftp fall back to FTP if FTPS is not
- supported in the target server
- WARC options:
- --warc-file=FILENAME save request/response data to a .warc.gz
- file
- --warc-header=STRING insert STRING into the warcinfo record
- --warc-max-size=NUMBER set maximum size of WARC files to NUMBER
- --warc-cdx write CDX index files
- --warc-dedup=FILENAME do not store records listed in this CDX
- file
- --no-warc-compression do not compress WARC files with GZIP
- --no-warc-digests do not calculate SHA1 digests
- --no-warc-keep-log do not store the log file in a WARC
- record
- --warc-tempdir=DIRECTORY location for temporary files created by
- the
- WARC writer
- Recursive download:
- -r, --recursive specify recursive download
- -l, --level=NUMBER maximum recursion depth (inf or 0 for
- infinite)
- --delete-after delete files locally after downloading
- them
- -k, --convert-links make links in downloaded HTML or CSS
- point to
- local files
- --convert-file-only convert the file part of the URLs only
- (usually known as the basename)
- --backups=N before writing file X, rotate up to N
- backup files
- -K, --backup-converted before converting file X, back up as
- X.orig
- -m, --mirror shortcut for -N -r -l inf --no-remove-
- listing
- -p, --page-requisites get all images, etc. needed to display
- HTML page
- --strict-comments turn on strict (SGML) handling of HTML
- comments
- Recursive accept/reject:
- -A, --accept=LIST comma-separated list of accepted
- extensions
- -R, --reject=LIST comma-separated list of rejected
- extensions
- --accept-regex=REGEX regex matching accepted URLs
- --reject-regex=REGEX regex matching rejected URLs
- --regex-type=TYPE regex type (posix|pcre)
- -D, --domains=LIST comma-separated list of accepted domains
- --exclude-domains=LIST comma-separated list of rejected domains
- --follow-ftp follow FTP links from HTML documents
- --follow-tags=LIST comma-separated list of followed HTML
- tags
- --ignore-tags=LIST comma-separated list of ignored HTML tags
- -H, --span-hosts go to foreign hosts when recursive
- -L, --relative follow relative links only
- -I, --include-directories=LIST list of allowed directories
- --trust-server-names use the name specified by the redirection
- URL's last component
- -X, --exclude-directories=LIST list of excluded directories
- -np, --no-parent don't ascend to the parent directory
- Email bug reports, questions, discussions to <bug-wget@gnu.org>
- and/or open issues at https://savannah.gnu.org/bugs/?
- func=additem&group=wget.
Add Comment
Please, Sign In to add comment