Sparse data structures in Python
Imagine you have a 2-D matrix with hundreds of million elements, where only a few of them contain non-zero values. When storing such a matrix using conventional approach, we would waste a lot of space for zeros.
Sparse data structures allow us to store only non-zero values assuming the rest of them are zeros. This approach saves a lot of memory and computing time. In fact, you can often encounter such matrices when working with NLP or machine learning tasks.
In Python, sparse data structures are implemented in scipy.sparse module, which mostly based on regular numpy
arrays.
>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> csc_matrix((3, 4), dtype=np.int8).toarray()
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int8)
Let's create a random sparse matrix and compare its size to an identical regular one:
from scipy.sparse import random
def get_sparse_size(matrix):
# get size of a sparse matrix
return int((matrix.data.nbytes + matrix.indptr.nbytes + matrix.indices.nbytes) / 1024.)
# create a sparse matrix, 1000 x 100000
sparse_mat = random(10 ** 3, 10 ** 5, format='csr')
# get size of a sparse matrix
sparse_size = get_sparse_size(sparse_mat)
# convert sparse matrix to a regular matrix and get its size
regular_size = sparse_mat.toarray().nbytes / 1024.
>>> print("The size of sparse matrix is %s KiB" % sparse_size)
The size of sparse matrix is 11722 KiB
>>> print("The size of regular matrix is %s KiB" % regular_size)
The size of regular matrix is 781250.0 KiB
>>> print("Data compression ratio is %s" % (regular_size / sparse_size))
Data compression ratio is 66.6481829039413
Sparse matrix types in scipy
There are many ways to represent a sparse matrix, Scipy provides seven of them:
- Block Sparse Row matrix (BSR)
- Coordinate list matrix (COO)
- Compressed Sparse Column matrix (CSC)
- Compressed Sparse Row matrix (CSR)
- Sparse matrix with DIAgonal storage (DIA)
- Dictionary Of Keys based sparse matrix (DOK)
- Row-based linked list sparse matrix (LIL)
Each format has its pros and cons, so it is important to know about the difference between them.
Dictionary of keys (DOK)
Dictionary of keys (dok_matrix
in scipy) is the easiest way to implement a sparse matrix. As the name suggests, it's based on a dictionary, in which the keys are tuples representing indices, i.e. tuple(row, column)
.
>>> import numpy as np
>>> from scipy.sparse import random
>>> np.random.seed(10)
>>>
>>> matrix = random(3, 3, format='dok', density=0.4)
>>> matrix[1, 1] = 33
>>> # dict[tuple(row, column)] = value
... matrix[(2, 1)] = 10
>>>
>>> matrix.toarray()
[[ 0.49850701 0. 0. ]
[ 0.22479665 33. 0. ]
[ 0. 10. 0. ]]
>>> dict(matrix)
{(1, 0): 0.22479664553084766, (0, 0): 0.49850701230259042, (1, 1): 33.0, (2, 1): 10.0}
Advantages:
- efficient access to individual items (O(1) on average)
- fast construction
- flexible structure
- can be efficiently converted to COO format
Disadvantages:
- very slow iteration in lexicographical order (due to the random order of keys)
- slow arithmetics
- slow slicing
List of list (LIL)
A row-based format (lil_matrix
in scipy), which uses two numpy arrays with regular Python lists inside them. The rows
array stores information about occupied cells, whereas the data
array stores corresponding values.
>>> import numpy as np
>>> from scipy.sparse import random
>>>
>>> np.random.seed(10)
>>>
>>> matrix = random(3, 3, format='lil', density=0.6)
>>> matrix.toarray()
array([[ 0. , 0.08833981, 0.16911084],
[ 0. , 0. , 0. ],
[ 0.19806286, 0.76053071, 0.22479665]])
>>> matrix.data
array([[0.08833981417401027, 0.16911083656253545], [],
[0.19806286475962398, 0.7605307121989587, 0.22479664553084766]], dtype=object)
>>> matrix.rows
array([[1, 2], [], [0, 1, 2]], dtype=object)
A simplified algorithm of retrieving a single item:
def get_item(row_index, column_index, matrix):
row_values = matrix.data[row_index]
row_indices = matrix.rows[row_index]
value_index = row_indices.index(column_index)
if value_index >= 0:
return row_values[value_index]
else:
return 0
print(matrix[2, 2])
Advantages:
- fast incremental construction
Disadvantages:
- slow arithmetics
- slow column slicing
- memory greedy
Coordinate list (COO)
In scipy, a COO (coo_matrix
) format uses three arrays, for every non-zero value, there is an entry in all of them.
>>> import numpy as np
>>> from scipy.sparse import random
>>> np.random.seed(11)
>>>
>>> matrix = random(3, 3, format='coo', density=0.35)
>>>
>>> matrix.toarray()
array([[ 0.72493393, 0. , 0. ],
[ 0.4202036 , 0. , 0.4854271 ],
[ 0. , 0. , 0. ]])
The data
array is storing all non-zero values, whereas row
and col
are storing corresponding indices for these values.
data | row | col |
---|---|---|
0.72493393 | 0 | 0 |
0.4202036 | 1 | 0 |
0.4854271 | 1 | 2 |
To find a specific value in the matrix, you need to iterate over both index arrays, which makes accessing slow when comparing to other formats.
Advantages:
- fast incremental construction
- fast item-wise arithmetics
- fast conversion to CSR/CSC formats
- allows duplicate items
Disadvantages:
- slow access
- slow arithmetics
Compressed Sparse format
Compressed sparse row (CSR) and compressed sparse column (CSC) are widely known and most used formats. Mainly, they are used for write-once-read-many tasks.
Internally, CSR is based on three numpy arrays:
data
is an array which contains all non-zero entries in the row-major order.indptr
points to row starts (i.e., tells where each row begins).indices
is array of column indices (i.e., tells us which cells have non-zero values)
import numpy as np
from scipy.sparse import random
np.random.seed(10)
# Generate a random binary sparse matrix
matrix = random(5, 5, format='csr', density=0.25)
# Replace all non zero values with index number
matrix.data[:] = np.arange(1, matrix.data.shape[0]+1)
We can access and modify these arrays:
>>> print(matrix.toarray())
[[ 1. 0. 0. 2. 0.]
[ 0. 0. 0. 3. 0.]
[ 0. 0. 0. 4. 0.]
[ 0. 0. 0. 0. 0.]
[ 5. 6. 0. 0. 0.]]
>>> matrix.data
[ 1. 2. 3. 4. 5. 6.]
>>> matrix.indices
[0 3 3 3 0 1]
>>> matrix.indptr
[0 2 3 4 4 6]
A simplified algorithm of item indexing looks as follows:
def get_item(row_index, column_index, matrix):
# Get row values
row_start = matrix.indptr[row_index]
row_end = matrix.indptr[row_index + 1]
row_values = matrix.data[row_start:row_end]
# Get column indices of occupied values
index_start = matrix.indptr[row_index]
index_end = matrix.indptr[row_index + 1]
# contains indices of occupied cells at a specific row
row_indices = list(matrix.indices[index_start:index_end])
# Find a positional index for a specific column index
value_index = row_indices.index(column_index)
if value_index >= 0:
return row_values[value_index]
else:
# non-zero value is not found
return 0
>>> get_item(0, 3, matrix)
2.0
>>> get_item(1, 3, matrix)
3.0
Advantages:
- efficient item access and slicing
- fast arithmetics
- fast vector products
Disadvantages:
- expensive editing
The Compressed Sparse Column (CSC) format is almost identical, except that values are indexed first by column with a column-major order. Usually, the CSC is used when there are more rows than columns. On the contrary, the CSR works better for a 'wide' format.
Block Sparse Row matrix (BSR) and DIAgonal storages
The diagonal storage (dia_matrix
is scipy) is used when you need to store diagonal matrices. In scipy, the implementation is not limited to main diagonal only. All diagonals are stored using two arrays, one for data and one for diagonal offsets.
The block sparse row format is very similar to CSR, except it stores regular patterns of blocks (squares) which contain mostly non-zero data.