So you want to build your collections of boxes bounding pixels-picturing-objects.
First of all, you need gather a series of such group of pixels (aka images).
You can easily accomplish this by means of any of the notorious tubes.
Go find videos (the more the merrier) showing the sought objects; for example,
in case you are interested in detecting low-res musical instruments, try

$ youtube-dl -f best[ext=mp4] -o ./video.mp4

Then it's time to butcher the video in frames.
You do not need to finely chop it tho, 1 image per second will do (-r 1)

$ ffmpeg -i video.mp4 -r 1 -f image2 ./imgs/image-%07d.png

Now the fun begins, fire up your trustworthy annotation tool and go for it

falling protocol droid

Tools of the trade:

Go back