Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # Using golang
- ## To crawl for images
- ### Rational
- For a project I'm working on I need a couple 100 images. I also want to minimise the laborious manual work required to select relevant images. It seems golang is built for this kind of task. I might do similar things in the future so I want something fairly generic.
- ### Thoughts before coding starts
- To make this tool more useful for everyone and future me the tool should be controlled with command line flags. I believe tools are better like this while larger more persistent application like web servers are better off with config files. Keeping with the inherited tradition of using command line flags it should also produce any output on stdout and any errors on stderr.
- Seeing as the tool will save files (to a directory passed to it as an argument) it should follow a set format for file names. I think it must include the keyword the image was found with and a uid. I have decided on this:
- `{keyword}-{uid}.jpg`
- I also think such a tool should be able to limit it's run duration and stop itself gracefully. There are two ways in which it would be reasonable to limit the application runtime, the first is number of images download, and the second runtime. These both seem rather simple to implement, so I'm going to do both. I am sure you can think of circumstance where both of these could be useful. In fact the user could set both and the tool could stop at the first limit it reaches, like a time out.
- Like most modern unix style download tools some information on the progress of the tools runtime should be continually output. Time estimation may also fit into this.
- To ensure the tool doesn't crawl anywhere it shouldn't it should be able to read and act accordingly to relevant entries to the robot.txt file commonly found in the root directory of websites.
- ### What should this tool be called?
- This may not seem like a relevant question to answer, but I need something to the repository, so a name would be nice. I have chosen the name gimgcrawl. This name describes what it is:
- 1. It is general
- 2. It is a utility for crawling for images
- ### What licence should it be?
- Like the previous question, this may seem irrelevant, but this is a tool so people must be able to use it. I could just use a three clause bsd licence, but this would mean that the community wouldn't necessary gain from improvements made by third parties. It may be off putting to some, but the gpl ensures that the derivative work is still contributed back to the community. THis is why the project will be gplv3.
- ### Manual
- #### Flags
- - --help # results in a list of all options
- - -d {output directory} # direct images to directory (default: working directory)
- - -t {Time in seconds} # limits execution duration by time (default: no limit)
- - -n {Number of images} # limits execution duration by number of images downloaded (default: no limit)
- - -s {URI} # Place to start crawling from (required)
- - -k {keywords} # list of key words space separated phrases can be in double quotes eg "key phrase" double quotes can be escaped with a \ (default: downloads all images)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement