Skip to content

aav-antonov/DEA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Envelopment Analysis (DEA) Computational Library

This repository provides a Python library for applying Data Envelopment Analysis (DEA) to real-world data. It includes several classes to compute various tasks related to DEA:

  1. Dea()
    File: libDEA/dea_instance.py
    Solves a standard single DEA instance.

  2. DeaMultiprocessing()
    File: libDEA/dea_multiprocessing.py
    Solves multiple DEA instances in parallel using multiprocessing.

  3. DeaLargeScale()
    File: libDEA/dea_largescale.py
    Optimizes performance for large-scale cases.

  4. DeaProfile()
    File: libDEA/dea_profile.py
    Visualizes the efficiency surface.

The package uses the linear solver from the ortools library. It is free to use, provided you comply with the terms and conditions of ortools.

For more details, see my PhD-related paper: DEA.pdf

Installation

git clone --branch  main https://github.com/aav-antonov/DEA.git
cd DEA
pip install . e

Usage

Below snippets of code how you can use libDEA explaining also input data:

#imports
from libDEA.dea_multiprocessing import DeaMultiprocessing
from libDEA.dea_largescale import DeaLargeScale
from libDEA.dea_profile import DeaProfile

#input X and Y
# generate 100 random DMUs with 3 inputs (what DMU is consuming) and 2 ouputs (what DMU is producing): 
X = np.random.uniform(0, 10, size=(3, 100))
Y = np.random.uniform(0, 10, size=(2, 100))



#set up DeaMultiprocessing()
DEAMP = DeaMultiprocessing()
DEAMP.set_DEA(X, Y, q_type="x")
q = DEAMP.run(X, Y, q_type="x")

# q is of size 100 - efficiency (0 <= q <=1) for each DMU (X,Y)

#set up  DeaLargeScale() more computationally efficient , can handle cases with > 10 000 DMUs  
DEALS = DeaLargeScale()
q = DEALS.run(X, Y, q_type="x")



Test DeaLargeScale()

In the DEA folder, you can benchmark and test the performance and correctness of the two DEA implementations by running:

python test_benchmark.py

This script benchmarks and tests both accuracy and computational efficiency for:

  • DeaMultiprocessing: Base method that computes efficiency for each unit directly using multiprocessing.
  • DeaLargeScale: Optimized version designed for large-scale data and improved performance.

Random datasets of varying sizes are generated. Both methods are executed, results are compared for accuracy, and computation time is measured.

All tests were run on a machine with 4 CPU cores.

Execution Time Summary

m fX_k fY_k DeaMultiprocessing (s) DeaLargeScale (s)
250 5 3 5.6306 5.8992
500 5 3 21.8980 14.5271
1000 5 3 86.2211 37.7303
2000 5 3 351.1777 105.9509
4000 5 3 1500.0* 240.7928
8000 5 3 6200.0* 607.6985
  • * Extrapolated values for DeaMultiprocessing (based on observed scaling from smaller dataset runs).

Data Envelopment Analysis (DEA)

DEA evaluates the relative efficiency of a set of decision-making units (DMUs) by analyzing their input/output combinations. Each DMU is represented by a vector of inputs $x$ and outputs $y$. For multiple DMUs, inputs and outputs are organized into matrices $X$ and $Y$.

The classical input-oriented DEA efficiency score for a DMU $o$ (where $o = 1, \ldots, n$) is computed by solving the following linear program:

$$ \begin{align*} \text{Minimize}\quad & \theta_o \\\ \text{Subject to}\quad & \sum_{j=1}^n \lambda_j x_{ij} \leq \theta x_{io},\quad \forall i = 1, \ldots, m \\\ & \sum_{j=1}^n \lambda_j y_{rj} \geq y_{ro},\quad \forall r = 1, \ldots, s \\\ & \lambda_j \geq 0, \quad \forall j = 1, \ldots, n \end{align*} $$

where:

  • $x_{ij}$: the $i$-th input for DMU $j$,
  • $y_{rj}$: the $r$-th output for DMU $j$,
  • $m$: number of input variables,
  • $s$: number of output variables,
  • $n$: number of DMUs,
  • $\lambda_j$: weights for constructing a reference DMU,
  • $\theta_o$: efficiency score for DMU $o$ ($\theta \leq 1$; $\theta = 1$ means efficient).

The solution $\theta_o$ is the efficiency of DMU $o$. A DMU is considered efficient if $\theta_o = 1$, and inefficient if $\theta_o &lt; 1$ compared to the rest of the dataset.

Complexity of the Problem

Given input and output matrices $X$ and $Y$, the above linear program must be solved separately for each DMU. The computational time required to solve these problems commonly grows super-quadratically with the number of DMUs ($n$), because both the number of linear programs and the size of each program (determined by the number of DMUs and variables) increase as $n$ increases. Specifically, the size of each individual linear program grows with $n$ (the number of DMUs appears in both constraints and variables), so the total computational time in average case increases faster than $O(n^2)$—often approaching cubic growth for large datasets.

As a result, evaluating DEA efficiency for datasets with more than 1,000 DMUs, even with a moderate number of inputs (e.g., $m = 5$) and outputs (e.g., $s = 3$), can require hours of computation—even when using parallel processing on multiple CPU cores.

Improving Computational Performance

A standard way to reduce computational time in DEA is to exploit the fact that the efficiency of each DMU can be determined using only the set of efficient DMUs, rather than the full matrices $X$ and $Y$. In practice, for fixed input and output dimensions, the number of efficient DMUs tends to saturate as the dataset grows—adding more DMUs usually results in only a few additional efficient units. Therefore, a common strategy is to first identify the efficient set (referred to as the full_base), which is typically much smaller than the total number of DMUs, and then compute the efficiency of all other DMUs using only this set. The DeaLargeScale class implements this strategy to achieve significant computational improvements.

Steps in DeaLargeScale

Base Candidate Selection via Ratios
Calculate efficiency-related ratios for each column (DMU) and select preliminary candidates.
Let the candidate set be denoted as $B_0$.

Base Extension (Addbase)
For each DMU in $B_0$, solve the DEA model (e.g., linear programming) using the reference sets $X[B_0]$ and $Y[B_0]$. Retain those DMUs whose computed efficiency scores are greater than or equal to $1$.
Let this refined set be denoted as $B_1$.

Base Refinement (Rebase)
For each DMU in $B_1$, solve DEA using $X[B_1]$ and $Y[B_1]$ as reference sets, and retain only the efficient DMUs, yielding set $B_2$. At this stage, $B_2$ should contain the complete subset of efficient DMUs from the original $X$ and $Y$. This step ensures removal of any inefficient DMUs from $B_1$.

Final Compute
For each DMU in the original matrices $X$ and $Y$, solve DEA using the reference sets $X[B_2]$ and $Y[B_2]$. Thus, we still solve $n$ linear programs (one per DMU), but the size of each problem is significantly reduced compared to using the full set.

Visualizing Multi-Dimensional Efficient Frontier with libDEA

Data Envelopment Analysis (DEA) is a powerful methodology for constructing production functions based on empirical observations of Decision Making Units (DMUs) performance. The libDEA library provides specialized tools to visualize the efficient frontier through various projections.

Understanding Efficient Frontier Visualization

The efficient frontier represents the optimal performance boundary where DMUs operate at maximum efficiency. Since production processes often involve multiple inputs and outputs, libDEA offers two primary visualization approaches:

  • Output-Input Visualization (y-x profile): Shows the relationship between a specific output and input.
  • Input-Input Visualization (x-x profile): Compares two different inputs while holding outputs constant.

Test DeaProfile()

In the DEA folder, you can run test script to see plots produced by DeaProfile() class:

python test_profile.py

Usage:


from libDEA.dea_profile import DeaProfile

#input X and Y
# generate 100 random DMUs with 3 inputs (what DMU is consuming) and 2 ouputs (what DMU is producing): 
X = np.random.uniform(0, 10, size=(3, 100))
Y = np.random.uniform(0, 10, size=(2, 100))

#set up DeaProfile() to visualise a given DMU and Efficient frontier (Production function)
DP = DeaProfile()
DP.get_base(X, Y, q_type="x")

dmu_index = 1
print(f"Selecting DMU with index {dmu_index} for profiling")

x, y = X[:, dmu_index], Y[:, dmu_index]   

# Example of plotting y(x) profile
DP.get_yx_profile(x, y, file_output="plots/plot_yx.png")

# Example of plotting x(x) profile for different input pairs
DP.get_xx_profile(x, y, 0, 1, file_output="plot_xx")  # see plot_xx_0_1.png
DP.get_xx_profile(x, y, 1, 2, file_output="plot_xx")  # see plot_xx_1_2.png
DP.get_xx_profile(x, y, 0, 2, file_output="plot_xx")  # see plot_xx_0_2.png

The output plots would lool something like this:

y(x) plots

y(x) profile

xx plots

x0x1 profile x1x2 profile x0x2 profile

About

Implementation of Data Envelopment Analysis for Large-Scale Computations: This project demonstrates large-scale linear programming modeling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages