Advertisement
pcwizz

Untitled

Sep 15th, 2014
236
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.10 KB | None | 0 0
  1. # Using golang
  2. ## To crawl for images
  3.  
  4. ### Rational
  5.  
  6. For a project I'm working on I need a couple 100 images. I also want to minimise the laborious manual work required to select relevant images. It seems golang is built for this kind of task. I might do similar things in the future so I want something fairly generic.
  7.  
  8. ### Thoughts before coding starts
  9.  
  10. To make this tool more useful for everyone and future me the tool should be controlled with command line flags. I believe tools are better like this while larger more persistent application like web servers are better off with config files. Keeping with the inherited tradition of using command line flags it should also produce any output on stdout and any errors on stderr.
  11.  
  12. Seeing as the tool will save files (to a directory passed to it as an argument) it should follow a set format for file names. I think it must include the keyword the image was found with and a uid. I have decided on this:
  13.  
  14. `{keyword}-{uid}.jpg`
  15.  
  16. I also think such a tool should be able to limit it's run duration and stop itself gracefully. There are two ways in which it would be reasonable to limit the application runtime, the first is number of images download, and the second runtime. These both seem rather simple to implement, so I'm going to do both. I am sure you can think of circumstance where both of these could be useful. In fact the user could set both and the tool could stop at the first limit it reaches, like a time out.
  17.  
  18. Like most modern unix style download tools some information on the progress of the tools runtime should be continually output. Time estimation may also fit into this.
  19.  
  20. To ensure the tool doesn't crawl anywhere it shouldn't it should be able to read and act accordingly to relevant entries to the robot.txt file commonly found in the root directory of websites.
  21.  
  22. ### What should this tool be called?
  23.  
  24. This may not seem like a relevant question to answer, but I need something to the repository, so a name would be nice. I have chosen the name gimgcrawl. This name describes what it is:
  25.  
  26. 1. It is general
  27. 2. It is a utility for crawling for images
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement