This project applies Supervised Machine Learning (Regression) to predict the total minutes of delay for airline flights. By analyzing historical flight data, we identify key drivers of delays, specifically focusing on flight volume and airline carriers.
The model achieves high accuracy by using One-Hot Encoding to handle categorical airline data and Linear Regression to quantify the relationship between flight traffic and delay time.
- Data Cleaning: Handled missing values and filtered dataset to relevant operational metrics.
- Feature Engineering: Applied One-Hot Encoding to convert categorical
carriertext data (e.g., "Southwest", "Delta") into numerical format for the model. - Predictive Modeling: Built a Linear Regression model to predict continuous delay times.
- Performance Analysis: Evaluated model using R-Squared and Mean Absolute Error (MAE).
- Business Intelligence: Visualized seasonality, airport bottlenecks, and airline efficiency metrics.
We analyzed average delay minutes across all 12 months to find the best travel windows.
- Summer Peak (June/July): A sharp spike in delays occurs here, likely driven by high vacation travel volume and summer thunderstorms.
- Holiday Surge (December): A secondary peak occurs in December, correlating with winter holiday travel and snowstorms.
- The "Golden Window" (Sept/Nov): The data shows a significant dip in delays during autumn, making this the most efficient time of year to fly.
We identified the top 10 airports contributing the most to national delays.
- The Hub Effect: Major hubs like Atlanta (ATL), Chicago (ORD), and Dallas (DFW) consistently rank highest due to massive traffic volume.
- Geographic Factors: Coastal hubs like San Francisco (SFO) often appear due to weather volatility (fog) and congested airspace.
Total delay minutes can be misleading because large airlines naturally have more delays. We normalized the data to calculate Average Delay Minutes Per Flight.
- Result: This metric reveals which airlines are truly inefficient per passenger trip, separating those who simply fly often from those who manage operations poorly.
The primary model (including Carrier data) demonstrates an exceptionally strong fit.
| Metric | Score | Interpretation |
|---|---|---|
| R-Squared ( |
0.95 (95.3%) | The model explains 95% of the variance in flight delays. This confirms a strong linear relationship: More Flights + Specific Carriers = More Delays. |
| Mean Absolute Error (MAE) | ~951 minutes | On a monthly aggregate scale where delays often exceed 100,000 minutes, this represents an error rate of <1%. |
To validate our choice of a Linear Model, we hypothesized that flight delays might grow exponentially rather than linearly (e.g., congestion collapse). We conducted a separate test comparing a Standard Linear Model against a Polynomial Model (Degree 2) using only flight volume data.
| Model | R-Squared | Verdict |
|---|---|---|
| Linear (Volume Only) | 0.64 | Winner. The relationship implies strictly linear growth. |
| Polynomial (Degree 2) | 0.60 | Loser. Adding complexity introduced noise and reduced accuracy. |
Critical Findings:
- Linearity: Delays accumulate steadily. Adding curves (polynomials) did not improve the prediction.
- The "Carrier Effect": When we removed Carrier data for this specific test, accuracy dropped from 95% to 64%. This mathematically proves that 30% of delay variance is driven by who is flying (operational efficiency), not just how much they are flying.
- Python: Core programming language.
- Pandas: For data manipulation and aggregation.
- Scikit-Learn: For training the Linear Regression model.
- Seaborn / Matplotlib: For visualizing delay distributions and correlations.
- Clone the repository:
git clone [https://github.com/nellocoder/flight-delay-prediction.git](https://github.com/nellocoder/flight-delay-prediction.git)