# Sparse data structures in Python

Imagine you have a 2-D matrix with hundreds of million elements, where only a few of them contain non-zero values. When storing such a matrix using conventional approach, we would waste a lot of space for zeros.

Sparse data structures allow us to store only non-zero values assuming the rest of them are zeros. This approach saves a lot of memory and computing time. In fact, you can often encounter such matrices when working with NLP or machine learning tasks.

In Python, sparse data structures are implemented in scipy.sparse module, which mostly based on regular `numpy`

arrays.

>>> import numpy as np >>> from scipy.sparse import csc_matrix >>> csc_matrix((3, 4), dtype=np.int8).toarray() array([[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], dtype=int8)

Let's create a random sparse matrix and compare its size to an identical regular one:

from scipy.sparse import random def get_sparse_size(matrix): # get size of a sparse matrix return int((matrix.data.nbytes + matrix.indptr.nbytes + matrix.indices.nbytes) / 1024.) # create a sparse matrix, 1000 x 100000 sparse_mat = random(10 ** 3, 10 ** 5, format='csr') # get size of a sparse matrix sparse_size = get_sparse_size(sparse_mat) # convert sparse matrix to a regular matrix and get its size regular_size = sparse_mat.toarray().nbytes / 1024.

>>> print("The size of sparse matrix is %s KiB" % sparse_size) The size of sparse matrix is 11722 KiB >>> print("The size of regular matrix is %s KiB" % regular_size) The size of regular matrix is 781250.0 KiB >>> print("Data compression ratio is %s" % (regular_size / sparse_size)) Data compression ratio is 66.6481829039413

## Sparse matrix types in scipy

There are many ways to represent a sparse matrix, Scipy provides seven of them:

- Block Sparse Row matrix (BSR)
- Coordinate list matrix (COO)
- Compressed Sparse Column matrix (CSC)
- Compressed Sparse Row matrix (CSR)
- Sparse matrix with DIAgonal storage (DIA)
- Dictionary Of Keys based sparse matrix (DOK)
- Row-based linked list sparse matrix (LIL)

Each format has its pros and cons, so it is important to know about the difference between them.

### Dictionary of keys (DOK)

Dictionary of keys (`dok_matrix`

in scipy) is the easiest way to implement a sparse matrix. As the name suggests, it's based on a dictionary, in which the keys are tuples representing indices, i.e. `tuple(row, column)`

.

>>> import numpy as np >>> from scipy.sparse import random >>> np.random.seed(10) >>> >>> matrix = random(3, 3, format='dok', density=0.4) >>> matrix[1, 1] = 33 >>> # dict[tuple(row, column)] = value ... matrix[(2, 1)] = 10 >>> >>> matrix.toarray() [[ 0.49850701 0. 0. ] [ 0.22479665 33. 0. ] [ 0. 10. 0. ]] >>> dict(matrix) {(1, 0): 0.22479664553084766, (0, 0): 0.49850701230259042, (1, 1): 33.0, (2, 1): 10.0}

Advantages:

- efficient access to individual items (O(1) on average)
- fast construction
- flexible structure
- can be efficiently converted to COO format

Disadvantages:

- very slow iteration in lexicographical order (due to the random order of keys)
- slow arithmetics
- slow slicing

### List of list (LIL)

A row-based format (`lil_matrix`

in scipy), which uses two numpy arrays with regular Python lists inside them. The `rows`

array stores information about occupied cells, whereas the `data`

array stores corresponding values.

>>> import numpy as np >>> from scipy.sparse import random >>> >>> np.random.seed(10) >>> >>> matrix = random(3, 3, format='lil', density=0.6) >>> matrix.toarray() array([[ 0. , 0.08833981, 0.16911084], [ 0. , 0. , 0. ], [ 0.19806286, 0.76053071, 0.22479665]]) >>> matrix.data array([[0.08833981417401027, 0.16911083656253545], [], [0.19806286475962398, 0.7605307121989587, 0.22479664553084766]], dtype=object) >>> matrix.rows array([[1, 2], [], [0, 1, 2]], dtype=object)

A simplified algorithm of retrieving a single item:

def get_item(row_index, column_index, matrix): row_values = matrix.data[row_index] row_indices = matrix.rows[row_index] value_index = row_indices.index(column_index) if value_index >= 0: return row_values[value_index] else: return 0 print(matrix[2, 2])

Advantages:

- fast incremental construction

Disadvantages:

- slow arithmetics
- slow column slicing
- memory greedy

### Coordinate list (COO)

In scipy, a COO (`coo_matrix`

) format uses three arrays, for every non-zero value, there is an entry in all of them.

>>> import numpy as np >>> from scipy.sparse import random >>> np.random.seed(11) >>> >>> matrix = random(3, 3, format='coo', density=0.35) >>> >>> matrix.toarray() array([[ 0.72493393, 0. , 0. ], [ 0.4202036 , 0. , 0.4854271 ], [ 0. , 0. , 0. ]])

The `data`

array is storing all non-zero values, whereas `row`

and `col`

are storing corresponding indices for these values.

data | row | col |
---|---|---|

0.72493393 | 0 | 0 |

0.4202036 | 1 | 0 |

0.4854271 | 1 | 2 |

To find a specific value in the matrix, you need to iterate over both index arrays, which makes accessing slow when comparing to other formats.

Advantages:

- fast incremental construction
- fast item-wise arithmetics
- fast conversion to CSR/CSC formats
- allows duplicate items

Disadvantages:

- slow access
- slow arithmetics

### Compressed Sparse format

Compressed sparse row (CSR) and compressed sparse column (CSC) are widely known and most used formats. Mainly, they are used for write-once-read-many tasks.

Internally, CSR is based on three numpy arrays:

`data`

is an array which contains all non-zero entries in the row-major order.`indptr`

points to row starts (i.e., tells where each row begins).`indices`

is array of column indices (i.e., tells us which cells have non-zero values)

import numpy as np from scipy.sparse import random np.random.seed(10) # Generate a random binary sparse matrix matrix = random(5, 5, format='csr', density=0.25) # Replace all non zero values with index number matrix.data[:] = np.arange(1, matrix.data.shape[0]+1)

We can access and modify these arrays:

>>> print(matrix.toarray()) [[ 1. 0. 0. 2. 0.] [ 0. 0. 0. 3. 0.] [ 0. 0. 0. 4. 0.] [ 0. 0. 0. 0. 0.] [ 5. 6. 0. 0. 0.]] >>> matrix.data [ 1. 2. 3. 4. 5. 6.] >>> matrix.indices [0 3 3 3 0 1] >>> matrix.indptr [0 2 3 4 4 6]

A simplified algorithm of item indexing looks as follows:

def get_item(row_index, column_index, matrix): # Get row values row_start = matrix.indptr[row_index] row_end = matrix.indptr[row_index + 1] row_values = matrix.data[row_start:row_end] # Get column indices of occupied values index_start = matrix.indptr[row_index] index_end = matrix.indptr[row_index + 1] # contains indices of occupied cells at a specific row row_indices = list(matrix.indices[index_start:index_end]) # Find a positional index for a specific column index value_index = row_indices.index(column_index) if value_index >= 0: return row_values[value_index] else: # non-zero value is not found return 0

>>> get_item(0, 3, matrix) 2.0 >>> get_item(1, 3, matrix) 3.0

Advantages:

- efficient item access and slicing
- fast arithmetics
- fast vector products

Disadvantages:

- expensive editing

The Compressed Sparse Column (CSC) format is almost identical, except that values are indexed first by column with a column-major order. Usually, the CSC is used when there are more rows than columns. On the contrary, the CSR works better for a 'wide' format.

### Block Sparse Row matrix (BSR) and DIAgonal storages

The diagonal storage (`dia_matrix`

is scipy) is used when you need to store diagonal matrices. In scipy, the implementation is not limited to main diagonal only. All diagonals are stored using two arrays, one for data and one for diagonal offsets.

The block sparse row format is very similar to CSR, except it stores regular patterns of blocks (squares) which contain mostly non-zero data.

Want a monthly digest of these blog posts?