In order to effectively reduce and prevent the damage to the natural environment caused by illegal land reclamation and mineral excavation, an ETS-YOLO small target monitoring and recognition algorithm for the detection of various types of engineering vehicles in complex environments is proposed using cameras deployed to high towers. Firstly, the EfficientViT network is used to replace the backbone feature extraction network of YOLOv5s in order to improve the attention diversity and significantly reduce the number of model parameters. Secondly, a small target detection layer is added to enhance the network’s extraction of shallow semantic information to improve the performance of small target detection. Finally, the original NMS function is replaced with the soft non-maximal suppression algorithm (soft-NMS) to effectively recognize occluded and overlapped targets. The experimental results show that the improved model has a mean average precision (mAP) of 93.3%, a parameter count of 5.90 M, and a detection speed of 52 f/s. Compared with the YOLOv5s model, the mAP is improved by 2.6% and the parameter count is decreased by 16.1%.