Blind Spots in Autonomous Vision: Evaluating the Robustness of YOLOv8 Model in Adverse Weather Trevor Kwan Faculty of Science, Capilano University Instructors: Derek Howell, Eunice Chin Dec.9, 2025 Abstract This study investigates the efficiency of the You Only Look Once (YOLO) object detection algorithm for real-time traffic light detection under challenging real-world situations. Specifically, the study aims to quantify the performance and degradation in mean average precision (mAP) and measure the average inference speed (in frames per second) across different adverse weather classifications: rain, snow, and fog, while comparing against a dataset of clear skies weather. By combining large diverse datasets found in publicly available domains, a YOLO variant will be trained and evaluated against another separate dataset of images and videos of traffic lights. The findings will contribute to providing necessary benchmarks for evaluating the reliability in upcoming technological systems such as advanced driving assistance and autonomous vehicle systems. 1. Introduction The rapid evolution of Autonomous Vehicle (AV) technology promises to fundamentally reshape modern transportation, offering potential solutions to traffic congestion, carbon emissions, and the high incidence of human-error induced accidents. Chougule et al. (2024) state that as the automotive industry transitions to full automation, the reliability of the vehicle’s perception system becomes the single most critical determinant of safety. Among the myriad tasks an autonomous agent must perform, accurate and instantaneous traffic light detection is paramount. Unlike static obstacles or lane markings, traffic lights are dynamic regulatory mechanisms that dictate the flow of right-of-way; a failure to strictly adhere to their signals can result in catastrophic, high-speed collisions at intersections. Consequently, the development of robust computer vision systems capable of interpreting these signals with near-perfect accuracy is a non-negotiable requirement for the safe deployment of AVs on public roads. In recent years, deep learning algorithms, specifically Convolutional Neural Networks (CNNs), have emerged as the standard architecture for visual perception in AVs. Architectures such as You Only Look Once (YOLO) have revolutionized real-time object detection by treating detection as a single regression problem, allowing for inference speeds that rival human reaction times. In controlled, "ideal" environments characterized by clear skies, balanced illumination, and standard infrastructure these models have achieved detection rates that approach, and in some metrics exceed, human performance. However, the operational design domain of a real-world vehicle is rarely ideal. It is subject to the chaotic entropy of the open environment, where severe weather (rain, snow, fog), extreme lighting variability (sun glare, urban night-glow), and non-standard traffic infrastructure introduce significant visual noise. The discrepancy between model performance in sterile training environments and chaotic real-world scenarios represents a critical "safety gap" that current research must address. The progression of traffic light detection technology has been intrinsically linked to advancements in hardware acceleration. The advent of high-performance Graphics Processing Units (GPUs) has fundamentally altered the feasibility of deploying deep neural networks in mobile agents. As noted by Zhang et al. (2010), image processing algorithms are computationally expensive and highly parallelizable; the GPU’s architecture allows for the simultaneous calculation of thousands of pixel matrices, enabling the vehicle to process high-resolution video feeds in milliseconds. This computational throughput supports complex architectures that often combine the spatial pattern recognition of CNNs with the temporal memory of Recurrent Neural Networks (RNNs). While CNNs are the primary engine for feature extraction, identifying the shapes, colors, and edges that constitute a "traffic light”, RNNs provide a necessary temporal context. The state of a traffic light is a temporal phenomenon; a yellow light is not merely a colored bulb but a sequential warning following a green signal and preceding a red one. By retaining information from previous frames, RNN modules allow the AI to maintain "object permanence" and contextual reasoning, smoothing out momentary flickers or occlusions that might otherwise confuse a single-frame detector. Despite these architectural advancements, a profound epistemological challenge remains: the "Black Box" nature of deep learning. As highlighted by Wang et al. (2020) and Park & Yang (2019), CNNs function as opaque non-linear approximations. We can observe the input (pixel data) and the output (classification), but the internal logic, the millions of weight adjustments that lead the model to label a cluster of pixels as a "Red Light", remains largely inaccessible. This lack of transparency poses a significant hurdle for safety verification. Unlike white-box algorithms, where the decision tree is transparent and auditable, deep learning models rarely offer post-hoc explanations for their failures (Li et al., 2022). If a YOLO model fails to detect a red light in a snowstorm, it cannot tell engineers the reason it failed. Whether it mistook the snow for noise, or if the light’s edges were too blurred to trigger a filter. Therefore, without clear insight into the model's internal reasoning, the scientific community must rely on rigorous, empirical "stress testing" of inputs and outputs. We must treat the model's accuracy as the sole justification for its process, necessitating exhaustive benchmarks across every conceivable adverse condition. This reliance on empirical benchmarking reveals the ultimate enemy of current computer vision: adverse weather. While the human eye is remarkably adaptable, utilizing context and biological high-dynamic-range capabilities to see through obscurants, computer vision systems are brittle when strictly reliant on RGB camera data. Furthermore, these environmental stressors often exacerbate the inherent limitations of the sensors themselves. In nighttime conditions, the dynamic range of standard cameras is tested by the glare of streetlamps and oncoming headlights, which can wash out the color of traffic signals. Conversely, in low-light scenarios, the sensor gain must be increased, introducing digital noise that the CNN may misinterpret as texture or objects. When these lighting challenges are combined with atmospheric particulates like fog or rain, the signal-to-noise ratio drops precipitously. Munir et al. (2025) suggests that standard YOLO models, while efficient, may lack the feature granularity required to distinguish small, distant traffic lights amidst this environmental noise. This has led to the proposal of feature fusion techniques and architectural enhancement that merges high-resolution spatial data from the shallow layers of the network with the rich semantic data of the deep layers. By fusing these features, the model may theoretically retain enough fine-grained detail to detect a traffic light’s edges even when the semantic understanding is clouded by fog or rain. This study aims to empirically quantify the degradation of the YOLO object detection algorithm under these challenging real-world situations. Moving beyond the "Black Box" limitation, we seek to map the external failure points of the model. Understanding exactly how different weather patterns disrupt computer vision allows engineers to build more robust safeguards. Until these blind spots in the AI’s perception are mapped, quantified, and mitigated, the promise of a fully autonomous future remains suspended behind a veil of uncertainty. This research contributes to that necessary mapping, providing a benchmark for the reliability of upcoming advanced driving assistance systems (ADAS) and ensuring that the "eyes" of the future vehicle are sharp enough to navigate the storm. 2. Research Questions: This research aims to answer the following question: ​ Does the YOLOv8 model perform worse in adverse weather (rain, snow, fog) in terms of Mean Average Precision (mAP) and inference speed? And if so, which specific environmental condition (Rain, Snow, or Fog) causes the most severe drop in Mean Average Precision (mAP)? Hypothesis: Null Hypothesis: There is no significant difference in the Mean Average Precision (mAP) of the YOLO model between clear, rainy, snowy, and foggy conditions. Alternative Hypothesis: There is a significant difference in the Mean Average Precision (mAP) of the YOLO model between clear, rainy, snowy, and foggy conditions. Predictions Based on the optical limitations of standard RGB camera sensors and the architectural dependencies of Convolutional Neural Networks (CNNs), I predict that the YOLO model will demonstrate a statistically significant degradation in Mean Average Precision (mAP) across all adverse weather cohorts when compared to the baseline control (Clear Skies). This prediction is grounded in the understanding that YOLO models, like most Convolutional Neural Networks (CNNs), rely heavily on two specific visual features to identify objects: sharp edge gradients to define spatial boundaries and distinct chromatic signatures to classify the state of the light. Adverse weather systematically corrupts these features in unique ways. 3. Methodology This study employs a quantitative experimental design to evaluate the robustness of the You Only Look Once (YOLO) object detection algorithm when subjected to environmental stressors. The methodology focuses on benchmarking the degradation of detection performance across strictly classified meteorological scenarios. See Figure 0 for a roadmap for the methodology. Figure 0) Roadmap for methodology 3.1 Dataset Curation and Preprocessing To ensure the model establishes a generalized understanding of traffic lights, a primary training dataset ("Dataset A") was aggregated from diverse open-source benchmarks (1051 images that were publicly available online). This dataset contains images of traffic lights under nominal conditions to establish a baseline of "ideal" performance. 900 images were used for training and 151 images were used for validation. For the testing phase ("Dataset B"), a separate, strictly stratified dataset was curated. This dataset is divided into three distinct environmental cohorts to isolate the impact of weather variables: ●​ Cohort 1 (Rain): 99 images featuring rain streaks, wet road reflections, and droplet occlusion. ●​ Cohort 2 ( Snow): 99 images featuring falling snowflakes and white-out conditions. ●​ Cohort 3 (Fog): 99 images featuring dense atmospheric scattering and low contrast. All images were preprocessed to a standard resolution of 640 pixels. Data augmentation was applied only during the training phase to prevent overfitting; the testing cohorts remain unaugmented to represent authentic driving conditions. 3.2 Experimental Procedure The YOLOv8 model was trained on 640p resolution and iterated over the course of 100 epochs for training the YOLOv8 model on a custom dataset. The images used for training are captured by the YOLOv8 model and trained to compare against manually labelled data for each of the traffic lights in the images. Manually labelled data contains information regarding the colour of the light (red, green, yellow, or undefined). Undefined being a state for the traffic light where it was not red, green, or yellow, but it was still important to put into its own classification. Examples of undefined traffic lights include images of traffic lights to the side, back or malfunctioning in a way that did not fit into the 3 prior classifications from visual information. Afterwords, five pieces of information are derived from every label for a traffic light: the classification (red, green, yellow, or undefined), x-axis position in the image, y-axis position in the image, x-axis dimension of traffic light, y-axis dimension of traffic light. The experiment measures the Mean Average Precision (mAP) which considers the recall, precision, and intersection over union (IoU). Recall quantifies the model’s ability to find all relevant targets. It measures the percentage of actual, ground-truth traffic lights in the dataset that were successfully detected by the model. Next precision evaluates the accuracy of the model's positive predictions. It calculates the proportion of detected objects that were correctly classified (e.g., ensuring a prediction labeled "Red Light" is indeed a red light, rather than background noise or a different signal). Lastly, Intersection over Union (IoU) serves as the metric for localization accuracy. It calculates the ratio of the overlapping area to the combined area between the model's predicted bounding box and the manually annotated ground-truth box. 4. Results This section presents the quantitative performance of the YOLOv8 model across four environmental cohorts: a Control group (Base Model) and three adverse weather scenarios (Snow, Fog, and Rain). The analysis focuses on the degradation of Mean Average Precision (mAP@50), Class-Specific Recall, and Inference Latency to evaluate the model's robustness. 4.1 Overall Performance Degradation As hypothesized, the model demonstrated a statistically significant drop in detection accuracy when introduced to adverse weather conditions. Looking at Figure 1 we see that under ideal conditions characterized by clear skies, the Base Model achieved a mean Average Precision (mAP@50) of 0.599, establishing the upper bound of the system's capabilities. However, the introduction of adverse weather variables precipitated a performance collapse exceeding 60% across all testing cohorts. As illustrated in the performance data, Snow proved to be the most manageable of the adverse conditions, maintaining a mAP of 0.217. In contrast, Rain recorded the lowest overall precision score of 0.129, falling even below Fog at 0.158. This hierarchy suggests a divergence in failure modes: while fog makes objects difficult to visually locate, rain introduces significant false positives through surface reflections, which severely penalizes the precision metric. Figure 1) Mean Average Precision (mAP) comparison between different weather conditions 4.2 Possible Reasons for Degradation Snow presents a dual threat of occlusion from falling flakes and "whitewashing," where accumulation lowers contrast and obscures the distinct edges of the traffic light housing. Similarly, fog alters the image physics through Mie scattering, acting as a "low-pass filter" that blurs edges and desaturates the distinct chromatic signatures required for detection. However, rain arguably presents the most severe and chaotic challenge for models like YOLO. Unlike conditions that simply obscure data, rain introduces active distortions; In Figure 2, we see droplets on the lens refract light, while wet pavement creates mirror-like reflections that may generate "false positive" light sources, creating a complex environment of phantom signals that is significantly harder to interpret than passive obscuration. Figure 2) Wet pavement creates mirror-like reflections 4.3 The Small Object Dilemma A primary driver for the critically low mAP scores across all adverse categories is the distinct spatial characteristics of the dataset. In the bottom right of Figure 3, we see that the label distribution analysis reveals that the vast majority of traffic lights occupy a negligible pixel area relative to the full image frame, often constituting less than a few percent of the total visual data. For Convolutional Neural Networks (CNNs) like YOLO, this presents a "vanishing feature" problem. In clear skies, the sharp high-contrast edges of the traffic light housing allow the model to retain these few pixels through the down-sampling layers. However, when rain streaks or atmospheric scattering from fog corrupt these already scarce pixels, the feature signature is effectively erased. The model becomes unable to distinguish the signal of the traffic light from the visual noise of the weather, leading to the severe drops in accuracy observed in the results. Figure 3) Top left - Frequencies of different classifications; Bottom left - Relative position of traffic lights in the dataset for training; Bottom right - Relative dimensions of the traffic lights in the dataset for training 4.4 Class-Specific Analysis and Recall Patterns In Figure 4 we see the class-specific heatmap and recall comparison reveal distinct failure patterns between the different weather types. Across all conditions, the model consistently detected Green Lights (0.389 mAP in Snow) with higher accuracy than Red Lights (0.320 mAP in Snow). This discrepancy may be attributed to the typically higher luminosity of green LEDs or the dataset bias, as there were slightly different instance counts between the classes. A near-total failure was observed in the detection of Yellow Lights, which dropped to a mAP of 0.018 in snowy conditions. This failure is directly correlated to the severe data imbalance identified in the dataset, where Yellow lights comprised only 58 instances compared to over 600 for Red lights. Consequently, the model lacked sufficient training examples to learn to separate the features of a yellow light from similar-colored environmental noise, such as streetlamps or white snow. Crucially, the recall data highlights the specific optical nature of fog. While Rain resulted in the lowest overall precision, Fog resulted in the lowest Recall score of 0.184 as seen in Figure 5. This confirms the hypothesis that fog acts as a low-pass blur filter, causing the model to miss the object entirely (False Negatives), whereas rain allows the object to be seen but confuses the classification with visual noise. Figure 4) Class specific performance heatmap for mAP Figure 5) Recall Comparison between different weather conditions 4.5 Inference Latency Despite the significant drop in accuracy, the model maintained real-time efficiency suitable for autonomous driving. In Figure 6, we see the inference times remained negligible across all cohorts, recording 4.0 ms for Snow, 5.0 ms for Fog, and 5.6 ms for Rain. The slight increase in latency for Rain and Fog suggests that the Non-Maximum Suppression post-processing step required additional time to filter through a higher volume of candidate boxes, likely caused by false positives from reflections or noise artifacts. Figure 6) Inference speed between different weather conditions 5. Discussion The primary objective of this study was to evaluate the robustness of YOLO-based traffic light detection under varying meteorological conditions. While performance drops were anticipated across all adverse weather scenarios, the results indicate a hierarchy of difficulty. Rain proved to be the most detrimental condition for model accuracy, resulting in a 0.470 drop in mean Average Precision (mAP) compared to the clear-weather baseline. This degradation exceeded the losses observed in both snow (0.382 drop) and fog (0.441 drop) scenarios when compared to the clear skies condition. The distinct performance gap between rain and other conditions can be attributed to the nature of the visual noise introduced. Snow and fog primarily act as passive obstacles. As observed, fog functions as a low-pass filter, reducing high-frequency details, while snow introduces mechanical occlusion. In these cases, the model typically fails via false negatives; it simply cannot see the traffic light. Rain, conversely, introduces active interference. The wet pavement creates mirror-like surface reflections that mimic the chromatic properties of traffic lights. Simultaneously, droplets on the lens refract incoming light. These phenomena generate high-confidence false positives, where the model incorrectly identifies a reflection on the road or a flare on the lens as a traffic signal. For an autonomous vehicle, a false positive (detecting a green light where there is none) is arguably more hazardous than a false negative (failing to detect a light and defaulting to a safety stop). This suggests that YOLO models, which rely heavily on spatial consistency, are particularly vulnerable to the geometric distortions unique to rainy environments. It is necessary to acknowledge the limitations of this study. First, the dataset relied on publicly available, free-online images, which may suffer from a lack of high-quality images that would enhance the training process. Second, the rain condition varies wildly in reality, from light drizzle to torrential downpour. Our study grouped these into a single category, potentially masking the specific threshold at which detection fails. Finally, the study utilized a trained YOLO model on a custom dataset with standard data augmentation; specific weather-based augmentation techniques (such as Generative Adversarial Networks to simulate rain) were not applied, which might have mitigated the observed losses. Future work should prioritize addressing the false positive paradox caused by rain reflections. One promising avenue is the integration of polarizing filters or post-processing algorithms designed specifically to suppress specular highlights on wet roads. Additionally, moving beyond unimodal reliance on RGB cameras to a sensor-fusion approach incorporating thermal imaging or radar could provide the redundancy necessary to distinguish a physical traffic light from its optical reflection on wet pavement. 6. Conclusion This study provides a critical assessment of the limitations inherent in current computer vision systems for autonomous vehicles. The results strongly support the hypothesis that the YOLO model's detection capabilities are significantly compromised by adverse weather conditions, challenging the assumption of "all-weather" autonomy. The hierarchy of difficulty for precision was determined to be Rain being the most difficult, followed by Fog, and then Snow; however, for pure detection (Recall), Fog proved to be the most challenging environment While the model demonstrated high precision in the Control (Clear Sky) scenarios, a statistically significant degradation in Mean Average Precision (mAP) was observed across all adverse conditions. The hierarchy of difficulty identified in this study reveals that while fog and snow introduce noise that lowers confidence, rain represents the most critical failure point. These findings suggest that relying solely on camera-based YOLO models is insufficient for autonomous driving in variable climates. The study concludes that visual data must be augmented with sensor fusion technologies, such as LiDAR or Radar, which are less susceptible to optical interference to ensure safety redundancy. Future research should focus on "de-hazing" preprocessing algorithms that can artificially restore contrast to video feeds before they reach the object detection network. References: Azam S., Sidratul Montaha, Kayes Uddin Fahim, A.K.M. Rakibul Haque Rafid, Md. Saddam Hossain Mukta, Mirjam Jonkman,. “Using feature maps to unpack the CNN ‘Black box’ theory with two medical datasets of different modality”, Intelligent Systems with Applications, Volume 18, (2023) 200233, https://doi.org/10.1016/j.iswa.2023.200233. Bu, Y., Ye, H., Tie, Z., Chen, Y., & Zhang, D. (2024). OD-YOLO: Robust Small Object Detection Model in Remote Sensing Image with a Novel Multi-Scale Feature Fusion. Sensors, 24(11), 3596. https://doi.org/10.3390/s24113596 Das S, Tariq A, Santos T, et al. Recurrent Neural Networks (RNNs): Architectures, Training Tricks, and Introduction to Influential Research. 2023 Jul 23. In: Colliot O, editor. Machine Learning for Brain Disorders [Internet]. New York, NY: Humana; 2023. Chapter 4. https://www.ncbi.nlm.nih.gov/books/NBK597502/ doi: 10.1007/978-1-0716-3195-9_4 Kampffmeyer M., Arnt-Borre Salberg, R. Jenssen; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2016, pp. 1-9. https://www.cv-foundation.org/openaccess/content_cvpr_2016_workshops/w19/html/Kampffmey er_Semantic_Segmentation_of_CVPR_2016_paper.html Kong Y., Zepu Wang, Yuqi Nie , Tian Zhou, Stefan Zohren, Yuxuan Liang,​ Peng Sun, Qingsong Wen. (2024). Unlocking the Power of LSTM for Long Term Time Series Forecasting. Arxiv. https://arxiv.org/html/2408.10006v1 Li S., X. Zhao, L. Stankovic, D. Mandic. Demystifying CNNs for Images by Matched Filters (2022), pp. 1-10, https://doi.org/10.1016/j.ymben.2022.05.007 Meyer, J., Becker, H., Beosch, P. M., & Axhausen, K. W. (2017). Autonomous vehicles: The next jump in accessibilities? Research in Transportation Economics, 62, 80–91. doi:10.1016/j.retrec.2017.03.005. Mungoli N. Adaptive Ensemble Learning: Boosting Model Performance through Intelligent Feature Fusion in Deep Neural Networks. Arxiv. https://doi.org/10.48550/arXiv.2304.02653 Park Y., H.S. Yang .Convolutional neural network based on an extreme learning machine for image classification. Neurocomputing, 339 (2019), pp. 66-76, 10.1016/j.neucom.2018.12.080 Pettigrew, S. (2017). Why public health should embrace the autonomous car. Australian and New Zealand Journal of Public Health, 41(1), 5–7. doi:10.1111/1753-6405.12588. Shariff, A., Bonnefon, J.-. F., & Rahwan, I. (2017). Psychological roadblocks to the adoption of self-driving vehicles. Nature Human Behaviour, 1(10), 694–696. doi:10.1038/ s41562-017-0202-6. Wang B., R. Ma, J. Kuang, Y. Zhang. How decisions are made in brains: Unpack “Black Box” of CNN with Ms. Pac-Man Video Game. IEEE access : practical innovations, open solutions, 8 (2020), pp. 142446-142458, 10.1109/ACCESS.2020.3013645 Yamashita R, Nishio M, Do RKG, Togashi K. Convolutional neural networks: an overview and application in radiology. Insights Imaging. 2018 Aug;9(4):611-629. doi: 10.1007/s13244-018-0639-9. Epub 2018 Jun 22. PMID: 29934920; PMCID: PMC6108980. Zhang Nan, Chen Yun-shan and Wang Jian-li, "Image parallel processing based on GPU," 2010 2nd International Conference on Advanced Computer Control, Shenyang, China, 2010, pp. 367-370, doi: 10.1109/ICACC.2010.5486836. Zhou, Q., Zhang, D., Liu, H., & He, Y. (2024). KCS-YOLO: An Improved Algorithm for Traffic Light Detection under Low Visibility Conditions. Machines, 12(8), 557–557. https://doi.org/10.3390/machines12080557