Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- https://www.youtube.com/watch?v=eIWFnNz8mF4&t=217s
- It is a concurrent scraper in @golang
- How it works:
- You call the binary (iterscraper) and you give it a URL "http://foo.com/%d" where '%d' is a pattern that will be replaced by an ID. e.g. 'http://foo.com/1' up to 'http://foo.com/9'. Then you can use how many go-routines you want to use at the same time (-concurrency). Then you chose where you should be writing the output (-output) to. Then the '-nameQuery, -addressQuery, emailQuery' are the CSS selectors we are going to be using to find whatever we are looking for (name? address? e-mail) in the URL (e.g. 'http://foo.com/1').
- A basic package used for scraping information from a website where URLs contain an incrementing integer. Information is retrieved from HTML5 elements, and outputted as a CSV.
- 1. Fetch the code:
- go get github.com/philipithomas/iterscraper
- 2. Go to the code
- cd github.com/philipithomas/iterscraper
- 3. Create a new branch
- git checkout -b work
- 4. Open VSCode
- code .
- main.go
- -------
- * Defines all the flags
- * Parses those flags
- * Uses a WaitGroup and Channels to communicate with different parts of the work
- * The different parts are 3:
- * emitTasks --> generates every single task that we need to do. Every task is a URL with an ID.
- It sends the task to the 'taskChan' channel
- * scrape --> Is a worker that will receive the task from the taskChannel, parse the URL and find
- whatever we need to find and then send the results to the 'dataChan' channel.
- * writeSites --> Writes all the output to a CSV file
Add Comment
Please, Sign In to add comment