Trained YOLO for 150 epochs on SC dataset.
- No more false positives of bottles/mallets predicted on background images *in the validation predictions*
** Found some other teams' public datasets on Roboflow (such as this one). ** Improving the quality of mallet example images for textual inversion.
- Diffusers script is doing random.choice from the 'imagenet_templates_small'
- Modified to have prompt labels correspond with each image
- Need to ensure example images are high quality, have minimal orientation differences, and have possibly complex backgrounds to be more robust
Used Hugging Face's example script to fine-tune Stable Diffusion 1.5 with new <wide-mouth-bottle> and <orange-mallet> tokens.
- Initial training run on the wide-mouth bottle went really poorly because of large backgrounds, model doing bad with "nalgene" text, and the bottle having inconsistent shape/color
- Second run with orange mallet is not as bad, but still lots of room for improvement
- Stable Diffusion 3 and 3.5 use a transformer instead of a U-Net, requires some modification of the Hugging Face example script
- Stable Diffusion 2 uses a U-Net, but online consensus seems to be that 1.5 is actually mostly better and easier to prompt
- To-do: experiment with modifying the prompt templates and using example images with a smaller mallet
Trained YOLO on original Roboflow dataset to get a baseline.
- From confusion matrix, really bad at recognizing backgrounds w/o object
- Very likely because only 60 / 2000 images are null class
Started work on generating synthetic data with the following pipeline:
- Use Stable Diffusion to generate a larger dataset with images in wide variety of contexts
- And use Hugging Face's textual inversion API to add the wide-mouth bottle and mallet as custom tokens
- Chose textual inversion over LoRA and Dreambooth because small files, local training, and the fact that bottle and mallet are already common objects
- Use Grounding Dino to automatically annotate each image