Najeebsk

Wget-Commands.txt

Jan 27th, 2022
178
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 20.56 KB | None | 0 0
  1. Continue an Incomplete Download
  2.  
  3. #Spider Websites with Wget – 20 Practical Examples
  4.  
  5. Wget is extremely powerful, but like with most other command line programs,
  6.  
  7. the plethora of options it supports can be intimidating to new users. Thus
  8.  
  9. what we have here are a collection of wget commands that you can use to
  10.  
  11. accomplish common tasks from downloading single files to mirroring entire
  12.  
  13. websites. It will help if you can read through the wget manual but for the
  14.  
  15. busy souls, these commands are ready to execute.
  16.  
  17. 1. Download a single file from the Internet
  18. wget http://example.com/file.iso
  19.  
  20. 2. Download a file but save it locally under a different name
  21. wget --output-document=filename.html example.com
  22.  
  23. 3. Download a file and save it in a specific folder
  24. wget --directory-prefix=folder/subfolder example.com
  25.  
  26. 4. Resume an interrupted download previously started by wget itself
  27. wget --continue example.com/big.file.iso
  28.  
  29. 5. Download a file but only if the version on server is newer than your
  30.  
  31. local copy
  32. wget --continue --timestamping wordpress.org/latest.zip
  33.  
  34. 6. Download multiple URLs with wget. Put the list of URLs in another text
  35.  
  36. file on separate lines and pass it to wget.
  37. wget --input list-of-file-urls.txt
  38.  
  39. 7. Download a list of sequentially numbered files from a server
  40. wget http://example.com/images/{1..20}.jpg
  41.  
  42. 8. Download a web page with all assets – like stylesheets and inline images
  43.  
  44. – that are required to properly display the web page offline.
  45. wget --page-requisites --span-hosts --convert-links --adjust-extension
  46.  
  47. http://example.com/dir/file
  48.  
  49. Mirror websites with Wget
  50.  
  51. 9. Download an entire website including all the linked pages and files
  52. wget --execute robots=off --recursive --no-parent --continue --no-clobber
  53.  
  54. http://example.com/
  55.  
  56. 10. Download all the MP3 files from a sub directory
  57. wget --level=1 --recursive --no-parent --accept mp3,MP3
  58.  
  59. http://example.com/mp3/
  60.  
  61. 11. Download all images from a website in a common folder
  62. wget --directory-prefix=files/pictures --no-directories --recursive --no-
  63.  
  64. clobber --accept jpg,gif,png,jpeg http://example.com/images/
  65.  
  66. 12. Download the PDF documents from a website through recursion but stay
  67.  
  68. within specific domains.
  69. wget --mirror --domains=abc.com,files.abc.com,docs.abc.com --accept=pdf
  70.  
  71. http://abc.com/
  72.  
  73. 13. Download all files from a website but exclude a few directories.
  74. wget --recursive --no-clobber --no-parent --exclude-directories
  75.  
  76. /forums,/support http://example.com
  77.  
  78. Wget for Downloading Restricted Content
  79.  
  80. Wget can be used for downloading content from sites that are behind a login
  81.  
  82. screen or ones that check for the HTTP referer and the User Agent strings of
  83.  
  84. the bot to prevent screen scraping.
  85.  
  86. 14. Download files from websites that check the User Agent and the HTTP
  87.  
  88. Referer
  89. wget --refer=http://google.com --user-agent=”Mozilla/5.0 Firefox/4.0.1?
  90.  
  91. http://nytimes.com
  92.  
  93. 15. Download files from a password protected sites
  94. wget --http-user=labnol --http-password=hello123
  95.  
  96. http://example.com/secret/file.zip
  97.  
  98. 16. Fetch pages that are behind a login page. You need to replace user and
  99.  
  100. password with the actual form fields while the URL should point to the Form
  101.  
  102. Submit (action) page.
  103. wget --cookies=on --save-cookies cookies.txt --keep-session-cookies --post-
  104.  
  105. data ‘user=labnol&password=123' http://example.com/login.php
  106. wget --cookies=on --load-cookies cookies.txt --keep-session-cookies
  107.  
  108. http://example.com/paywall
  109.  
  110. Retrieve File Details with wget
  111.  
  112. 17. Find the size of a file without downloading it (look for Content Length
  113.  
  114. in the response, the size is in bytes)
  115. wget --spider --server-response http://example.com/file.iso
  116.  
  117. 18. Download a file and display the content on screen without saving it
  118.  
  119. locally.
  120. wget --output-document – --quiet google.com/humans.txt
  121.  
  122. wget
  123. 19. Know the last modified date of a web page (check the Last Modified tag
  124.  
  125. in the HTTP header).
  126. wget --server-response --spider http://www.labnol.org/
  127.  
  128. 20. Check the links on your website to ensure that they are working. The
  129.  
  130. spider option will not save the pages locally.
  131. wget --output-file=logfile.txt --recursive --spider http://example.com
  132.  
  133. Also see: Essential Linux Commands
  134.  
  135. Wget – How to be nice to the server?
  136.  
  137. The wget tool is essentially a spider that scrapes / leeches web pages but
  138.  
  139. some web hosts may block these spiders with the robots.txt files. Also, wget
  140.  
  141. will not follow links on web pages that use the rel=nofollow attribute.
  142.  
  143. You can however force wget to ignore the robots.txt and the nofollow
  144.  
  145. directives by adding the switch --execute robots=off to all your wget
  146.  
  147. commands. If a web host is blocking wget requests by looking at the User
  148.  
  149. Agent string, you can always fake that with the --user-agent=Mozilla switch.
  150.  
  151. The wget command will put additional strain on the site’s server because it
  152.  
  153. will continuously traverse the links and download files. A good scraper
  154.  
  155. would therefore limit the retrieval rate and also include a wait period
  156.  
  157. between consecutive fetch requests to reduce the server load.
  158.  
  159. wget --limit-rate=20k --wait=60 --random-wait --mirror example.com
  160.  
  161. In the above example, we have limited the download bandwidth rate to 20 KB/s
  162.  
  163. and the wget utility will wait anywhere between 30s and 90 seconds before
  164.  
  165. retrieving the next resource.
  166.  
  167. Finally, a little quiz. What do you think this wget command will do?
  168. wget --span-hosts --level=inf --recursive dmoz.org
  169. -------------------------------------------------------
  170. How to Download and Save YouTube Videos With Download VLC
  171. Open VLC and click "Open Media"
  172. Click "Network" and paste YouTube URL
  173. If using Windows, select "Tools" and then "Codec Information"
  174. Find the "Location" bar at the bottom and copy that URL
  175. Paste that URL into your browser
  176. Right click the video and select "Save Video As"
  177. Name the file and save to desired location
  178. --------------------------------------------------------
  179. wget -c file
  180. Mirror an Entire Website Anything
  181. wget -m http://example.com
  182. --convert-links
  183. --page-requisites
  184. --no-parent
  185. Download an Entire Directory
  186. wget -r ftp://example.com/folder
  187. Download a List of Files at Once
  188. wget -i download.txt
  189. A Few More Tricks
  190. wget http://website.com/files/file.zip
  191. Basic startup options
  192. -V, --version
  193. -h, --help
  194. -b, --background
  195. -o logfile,
  196. --output-file=logfile
  197. -F, --force-html
  198. -q, --quiet
  199. -B URL
  200. --base=URL
  201. -------------------------------------------------------
  202. wget -i URL.txt
  203. wget -r -l1 -A.bz2 http://aaa.com/directory
  204. wget https://www.kernel.org/pub/linux/kernel/v3.0/linux-3.2.{1..15}.tar.bz2
  205.  
  206. C:\>wget -help
  207. GNU Wget 1.21.2, a non-interactive network retriever.
  208. Usage: wget [OPTION]... [URL]...
  209.  
  210. Mandatory arguments to long options are mandatory for short options too.
  211.  
  212. Startup:
  213. -V, --version display the version of Wget and exit
  214. -h, --help print this help
  215. -b, --background go to background after startup
  216. -e, --execute=COMMAND execute a `.wgetrc'-style command
  217.  
  218. Logging and input file:
  219. -o, --output-file=FILE log messages to FILE
  220. -a, --append-output=FILE append messages to FILE
  221. -d, --debug print lots of debugging information
  222. -q, --quiet quiet (no output)
  223. -v, --verbose be verbose (this is the default)
  224. -nv, --no-verbose turn off verboseness, without being quiet
  225. --report-speed=TYPE output bandwidth as TYPE. TYPE can be
  226.  
  227. bits
  228. -i, --input-file=FILE download URLs found in local or external
  229.  
  230. FILE
  231. --input-metalink=FILE download files covered in local Metalink
  232.  
  233. FILE
  234. -F, --force-html treat input file as HTML
  235. -B, --base=URL resolves HTML input-file links (-i -F)
  236. relative to URL
  237. --config=FILE specify config file to use
  238. --no-config do not read any config file
  239. --rejected-log=FILE log reasons for URL rejection to FILE
  240.  
  241. Download:
  242. -t, --tries=NUMBER set number of retries to NUMBER (0
  243.  
  244. unlimits)
  245. --retry-connrefused retry even if connection is refused
  246. --retry-on-http-error=ERRORS comma-separated list of HTTP errors
  247.  
  248. to retry
  249. -O, --output-document=FILE write documents to FILE
  250. -nc, --no-clobber skip downloads that would download to
  251. existing files (overwriting them)
  252. --no-netrc don't try to obtain credentials from
  253.  
  254. .netrc
  255. -c, --continue resume getting a partially-downloaded
  256.  
  257. file
  258. --start-pos=OFFSET start downloading from zero-based
  259.  
  260. position OFFSET
  261. --progress=TYPE select progress gauge type
  262. --show-progress display the progress bar in any verbosity
  263.  
  264. mode
  265. -N, --timestamping don't re-retrieve files unless newer than
  266. local
  267. --no-if-modified-since don't use conditional if-modified-since
  268.  
  269. get
  270. requests in timestamping mode
  271. --no-use-server-timestamps don't set the local file's timestamp by
  272. the one on the server
  273. -S, --server-response print server response
  274. --spider don't download anything
  275. -T, --timeout=SECONDS set all timeout values to SECONDS
  276. --dns-servers=ADDRESSES list of DNS servers to query (comma
  277.  
  278. separated)
  279. --bind-dns-address=ADDRESS bind DNS resolver to ADDRESS (hostname or
  280.  
  281. IP) on local host
  282. --dns-timeout=SECS set the DNS lookup timeout to SECS
  283. --connect-timeout=SECS set the connect timeout to SECS
  284. --read-timeout=SECS set the read timeout to SECS
  285. -w, --wait=SECONDS wait SECONDS between retrievals
  286. (applies if more then 1 URL is to be
  287.  
  288. retrieved)
  289. --waitretry=SECONDS wait 1..SECONDS between retries of a
  290.  
  291. retrieval
  292. (applies if more then 1 URL is to be
  293.  
  294. retrieved)
  295. --random-wait wait from 0.5*WAIT...1.5*WAIT secs
  296.  
  297. between retrievals
  298. (applies if more then 1 URL is to be
  299.  
  300. retrieved)
  301. --no-proxy explicitly turn off proxy
  302. -Q, --quota=NUMBER set retrieval quota to NUMBER
  303. --bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local
  304.  
  305. host
  306. --limit-rate=RATE limit download rate to RATE
  307. --no-dns-cache disable caching DNS lookups
  308. --restrict-file-names=OS restrict chars in file names to ones OS
  309.  
  310. allows
  311. --ignore-case ignore case when matching
  312.  
  313. files/directories
  314. -4, --inet4-only connect only to IPv4 addresses
  315. -6, --inet6-only connect only to IPv6 addresses
  316. --prefer-family=FAMILY connect first to addresses of specified
  317.  
  318. family,
  319. one of IPv6, IPv4, or none
  320. --user=USER set both ftp and http user to USER
  321. --password=PASS set both ftp and http password to PASS
  322. --ask-password prompt for passwords
  323. --use-askpass=COMMAND specify credential handler for requesting
  324. username and password. If no COMMAND
  325.  
  326. is
  327. specified the WGET_ASKPASS or the
  328.  
  329. SSH_ASKPASS
  330. environment variable is used.
  331. --no-iri turn off IRI support
  332. --local-encoding=ENC use ENC as the local encoding for IRIs
  333. --remote-encoding=ENC use ENC as the default remote encoding
  334. --unlink remove file before clobber
  335. --keep-badhash keep files with checksum mismatch (append
  336.  
  337. .badhash)
  338. --metalink-index=NUMBER Metalink application/metalink4+xml
  339.  
  340. metaurl ordinal NUMBER
  341. --metalink-over-http use Metalink metadata from HTTP response
  342.  
  343. headers
  344. --preferred-location preferred location for Metalink resources
  345.  
  346. Directories:
  347. -nd, --no-directories don't create directories
  348. -x, --force-directories force creation of directories
  349. -nH, --no-host-directories don't create host directories
  350. --protocol-directories use protocol name in directories
  351. -P, --directory-prefix=PREFIX save files to PREFIX/..
  352. --cut-dirs=NUMBER ignore NUMBER remote directory components
  353.  
  354. HTTP options:
  355. --http-user=USER set http user to USER
  356. --http-password=PASS set http password to PASS
  357. --no-cache disallow server-cached data
  358. --default-page=NAME change the default page name (normally
  359. this is 'index.html'.)
  360. -E, --adjust-extension save HTML/CSS documents with proper
  361.  
  362. extensions
  363. --ignore-length ignore 'Content-Length' header field
  364. --header=STRING insert STRING among the headers
  365. --compression=TYPE choose compression, one of auto, gzip and
  366.  
  367. none. (default: none)
  368. --max-redirect maximum redirections allowed per page
  369. --proxy-user=USER set USER as proxy username
  370. --proxy-password=PASS set PASS as proxy password
  371. --referer=URL include 'Referer: URL' header in HTTP
  372.  
  373. request
  374. --save-headers save the HTTP headers to file
  375. -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION
  376. --no-http-keep-alive disable HTTP keep-alive (persistent
  377.  
  378. connections)
  379. --no-cookies don't use cookies
  380. --load-cookies=FILE load cookies from FILE before session
  381. --save-cookies=FILE save cookies to FILE after session
  382. --keep-session-cookies load and save session (non-permanent)
  383.  
  384. cookies
  385. --post-data=STRING use the POST method; send STRING as the
  386.  
  387. data
  388. --post-file=FILE use the POST method; send contents of
  389.  
  390. FILE
  391. --method=HTTPMethod use method "HTTPMethod" in the request
  392. --body-data=STRING send STRING as data. --method MUST be set
  393. --body-file=FILE send contents of FILE. --method MUST be
  394.  
  395. set
  396. --content-disposition honor the Content-Disposition header when
  397. choosing local file names
  398.  
  399. (EXPERIMENTAL)
  400. --content-on-error output the received content on server
  401.  
  402. errors
  403. --auth-no-challenge send Basic HTTP authentication
  404.  
  405. information
  406. without first waiting for the server's
  407. challenge
  408.  
  409. HTTPS (SSL/TLS) options:
  410. --secure-protocol=PR choose secure protocol, one of auto,
  411.  
  412. SSLv2,
  413. SSLv3, TLSv1, TLSv1_1, TLSv1_2 and PFS
  414. --https-only only follow secure HTTPS links
  415. --no-check-certificate don't validate the server's certificate
  416. --certificate=FILE client certificate file
  417. --certificate-type=TYPE client certificate type, PEM or DER
  418. --private-key=FILE private key file
  419. --private-key-type=TYPE private key type, PEM or DER
  420. --ca-certificate=FILE file with the bundle of CAs
  421. --ca-directory=DIR directory where hash list of CAs is
  422.  
  423. stored
  424. --crl-file=FILE file with bundle of CRLs
  425. --pinnedpubkey=FILE/HASHES Public key (PEM/DER) file, or any number
  426. of base64 encoded sha256 hashes preceded
  427.  
  428. by
  429. 'sha256//' and separated by ';', to
  430.  
  431. verify
  432. peer against
  433. --random-file=FILE file with random data for seeding the SSL
  434.  
  435. PRNG
  436.  
  437. --ciphers=STR Set the priority string (GnuTLS) or cipher
  438.  
  439. list string (OpenSSL) directly.
  440. Use with care. This option overrides --
  441.  
  442. secure-protocol.
  443. The format and syntax of this string
  444.  
  445. depend on the specific SSL/TLS engine.
  446. HSTS options:
  447. --no-hsts disable HSTS
  448. --hsts-file path of HSTS database (will override
  449.  
  450. default)
  451.  
  452. FTP options:
  453. --ftp-user=USER set ftp user to USER
  454. --ftp-password=PASS set ftp password to PASS
  455. --no-remove-listing don't remove '.listing' files
  456. --no-glob turn off FTP file name globbing
  457. --no-passive-ftp disable the "passive" transfer mode
  458. --preserve-permissions preserve remote file permissions
  459. --retr-symlinks when recursing, get linked-to files (not
  460.  
  461. dir)
  462.  
  463. FTPS options:
  464. --ftps-implicit use implicit FTPS (default port is
  465.  
  466. 990)
  467. --ftps-resume-ssl resume the SSL/TLS session started in
  468.  
  469. the control connection when
  470. opening a data connection
  471. --ftps-clear-data-connection cipher the control channel only; all
  472.  
  473. the data will be in plaintext
  474. --ftps-fallback-to-ftp fall back to FTP if FTPS is not
  475.  
  476. supported in the target server
  477. WARC options:
  478. --warc-file=FILENAME save request/response data to a .warc.gz
  479.  
  480. file
  481. --warc-header=STRING insert STRING into the warcinfo record
  482. --warc-max-size=NUMBER set maximum size of WARC files to NUMBER
  483. --warc-cdx write CDX index files
  484. --warc-dedup=FILENAME do not store records listed in this CDX
  485.  
  486. file
  487. --no-warc-compression do not compress WARC files with GZIP
  488. --no-warc-digests do not calculate SHA1 digests
  489. --no-warc-keep-log do not store the log file in a WARC
  490.  
  491. record
  492. --warc-tempdir=DIRECTORY location for temporary files created by
  493.  
  494. the
  495. WARC writer
  496.  
  497. Recursive download:
  498. -r, --recursive specify recursive download
  499. -l, --level=NUMBER maximum recursion depth (inf or 0 for
  500.  
  501. infinite)
  502. --delete-after delete files locally after downloading
  503.  
  504. them
  505. -k, --convert-links make links in downloaded HTML or CSS
  506.  
  507. point to
  508. local files
  509. --convert-file-only convert the file part of the URLs only
  510.  
  511. (usually known as the basename)
  512. --backups=N before writing file X, rotate up to N
  513.  
  514. backup files
  515. -K, --backup-converted before converting file X, back up as
  516.  
  517. X.orig
  518. -m, --mirror shortcut for -N -r -l inf --no-remove-
  519.  
  520. listing
  521. -p, --page-requisites get all images, etc. needed to display
  522.  
  523. HTML page
  524. --strict-comments turn on strict (SGML) handling of HTML
  525.  
  526. comments
  527.  
  528. Recursive accept/reject:
  529. -A, --accept=LIST comma-separated list of accepted
  530.  
  531. extensions
  532. -R, --reject=LIST comma-separated list of rejected
  533.  
  534. extensions
  535. --accept-regex=REGEX regex matching accepted URLs
  536. --reject-regex=REGEX regex matching rejected URLs
  537. --regex-type=TYPE regex type (posix|pcre)
  538. -D, --domains=LIST comma-separated list of accepted domains
  539. --exclude-domains=LIST comma-separated list of rejected domains
  540. --follow-ftp follow FTP links from HTML documents
  541. --follow-tags=LIST comma-separated list of followed HTML
  542.  
  543. tags
  544. --ignore-tags=LIST comma-separated list of ignored HTML tags
  545. -H, --span-hosts go to foreign hosts when recursive
  546. -L, --relative follow relative links only
  547. -I, --include-directories=LIST list of allowed directories
  548. --trust-server-names use the name specified by the redirection
  549. URL's last component
  550. -X, --exclude-directories=LIST list of excluded directories
  551. -np, --no-parent don't ascend to the parent directory
  552.  
  553. Email bug reports, questions, discussions to <bug-wget@gnu.org>
  554. and/or open issues at https://savannah.gnu.org/bugs/?
  555.  
  556. func=additem&group=wget.
  557.  
Add Comment
Please, Sign In to add comment