09/04/2019
So you want to build your collections of boxes bounding pixels-picturing-objects. First of all, you need gather a series of such group of pixels (aka images). You can easily accomplish this by means of any of the notorious tubes. Go find videos (the more the merrier) showing the sought objects; for example, in case you are interested in detecting low-res musical instruments, try
$ youtube-dl -f best[ext=mp4] -o ./video.mp4 https://www.youtube.com/watch?v=s2YiJ13MRUE
Then it's time to butcher the video in frames. You do not need to finely chop it tho, 1 image per second will do (-r 1)
$ ffmpeg -i video.mp4 -r 1 -f image2 ./imgs/image-%07d.png
Now the fun begins, fire up your trustworthy annotation tool and go for it
Tools of the trade: