Sparse data structures in Python

written on December 29, 2017, in Python

Imagine you have a 2-D matrix with hundreds of million elements, where only a few of them contain non-zero values. When storing such a matrix using conventional approach, we would waste a lot of space for zeros.

Sparse data structures allow us to store only non-zero values assuming the rest of them are zeros. This approach saves a lot of memory and computing time. In fact, you can often encounter such matrices when working with NLP or machine learning tasks.

In Python, sparse data structures are implemented in scipy.sparse module, which mostly based on regular numpy arrays.

>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> csc_matrix((3, 4), dtype=np.int8).toarray()
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

Let's create a random sparse matrix and compare its size to an identical regular one:

from scipy.sparse import random

def get_sparse_size(matrix):
    # get size of a sparse matrix
    return int((matrix.data.nbytes + matrix.indptr.nbytes + matrix.indices.nbytes) / 1024.)

# create a sparse matrix, 1000 x 100000
sparse_mat = random(10 ** 3, 10 ** 5, format='csr')

# get size of a sparse matrix
sparse_size = get_sparse_size(sparse_mat)

# convert sparse matrix to a regular matrix and get its size
regular_size = sparse_mat.toarray().nbytes / 1024.
>>> print("The size of sparse matrix is %s KiB" % sparse_size)
The size of sparse matrix is 11722 KiB
>>> print("The size of regular matrix is %s KiB" % regular_size)
The size of regular matrix is 781250.0 KiB
>>> print("Data compression ratio is %s" % (regular_size / sparse_size))
Data compression ratio is 66.6481829039413

Sparse matrix types in scipy

There are many ways to represent a sparse matrix, Scipy provides seven of them:

  • Block Sparse Row matrix (BSR)
  • Coordinate list matrix (COO)
  • Compressed Sparse Column matrix (CSC)
  • Compressed Sparse Row matrix (CSR)
  • Sparse matrix with DIAgonal storage (DIA)
  • Dictionary Of Keys based sparse matrix (DOK)
  • Row-based linked list sparse matrix (LIL)

Each format has its pros and cons, so it is important to know about the difference between them.

Dictionary of keys (DOK)

Dictionary of keys (dok_matrix in scipy) is the easiest way to implement a sparse matrix. As the name suggests, it's based on a dictionary, in which the keys are tuples representing indices, i.e. tuple(row, column).

>>> import numpy as np
>>> from scipy.sparse import random
>>> np.random.seed(10)
>>>
>>> matrix = random(3, 3, format='dok', density=0.4)
>>> matrix[1, 1] = 33
>>> # dict[tuple(row, column)] = value
... matrix[(2, 1)] = 10
>>>
>>> matrix.toarray()
[[  0.49850701   0.           0.        ]
 [  0.22479665  33.           0.        ]
 [  0.          10.           0.        ]]
>>> dict(matrix)
{(1, 0): 0.22479664553084766, (0, 0): 0.49850701230259042, (1, 1): 33.0, (2, 1): 10.0}

Advantages:

  • efficient access to individual items (O(1) on average)
  • fast construction
  • flexible structure
  • can be efficiently converted to COO format

Disadvantages:

  • very slow iteration in lexicographical order (due to the random order of keys)
  • slow arithmetics
  • slow slicing

List of list (LIL)

A row-based format (lil_matrix in scipy), which uses two numpy arrays with regular Python lists inside them. The rows array stores information about occupied cells, whereas the data array stores corresponding values.

>>> import numpy as np
>>> from scipy.sparse import random
>>>
>>> np.random.seed(10)
>>>
>>> matrix = random(3, 3, format='lil', density=0.6)
>>> matrix.toarray()
array([[ 0.        ,  0.08833981,  0.16911084],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.19806286,  0.76053071,  0.22479665]])
>>> matrix.data
array([[0.08833981417401027, 0.16911083656253545], [],
       [0.19806286475962398, 0.7605307121989587, 0.22479664553084766]], dtype=object)
>>> matrix.rows
array([[1, 2], [], [0, 1, 2]], dtype=object)

A simplified algorithm of retrieving a single item:

def get_item(row_index, column_index, matrix):
    row_values = matrix.data[row_index]
    row_indices = matrix.rows[row_index]

    value_index = row_indices.index(column_index)
    if value_index >= 0:
        return row_values[value_index]
    else:
        return 0

print(matrix[2, 2])

Advantages:

  • fast incremental construction

Disadvantages:

  • slow arithmetics
  • slow column slicing
  • memory greedy

Coordinate list (COO)

In scipy, a COO (coo_matrix) format uses three arrays, for every non-zero value, there is an entry in all of them.

>>> import numpy as np
>>> from scipy.sparse import random
>>> np.random.seed(11)
>>>
>>> matrix = random(3, 3, format='coo', density=0.35)
>>>
>>> matrix.toarray()
array([[ 0.72493393,  0.        ,  0.        ],
       [ 0.4202036 ,  0.        ,  0.4854271 ],
       [ 0.        ,  0.        ,  0.        ]])

The data array is storing all non-zero values, whereas row and col are storing corresponding indices for these values.


datarowcol
0.7249339300
0.420203610
0.485427112


To find a specific value in the matrix, you need to iterate over both index arrays, which makes accessing slow when comparing to other formats.

Advantages:

  • fast incremental construction
  • fast item-wise arithmetics
  • fast conversion to CSR/CSC formats
  • allows duplicate items

Disadvantages:

  • slow access
  • slow arithmetics

Compressed Sparse format

Compressed sparse row (CSR) and compressed sparse column (CSC) are widely known and most used formats. Mainly, they are used for write-once-read-many tasks.

Internally, CSR is based on three numpy arrays:

  • data is an array which contains all non-zero entries in the row-major order.
  • indptr points to row starts (i.e., tells where each row begins).
  • indices is array of column indices (i.e., tells us which cells have non-zero values)
import numpy as np
from scipy.sparse import random

np.random.seed(10)

# Generate a random binary sparse matrix
matrix = random(5, 5, format='csr', density=0.25)
# Replace all non zero values with index number
matrix.data[:] = np.arange(1, matrix.data.shape[0]+1)

We can access and modify these arrays:

>>> print(matrix.toarray())
[[ 1.  0.  0.  2.  0.]
 [ 0.  0.  0.  3.  0.]
 [ 0.  0.  0.  4.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 5.  6.  0.  0.  0.]]

>>> matrix.data
[ 1.  2.  3.  4.  5.  6.]
>>> matrix.indices
[0 3 3 3 0 1]
>>> matrix.indptr
[0 2 3 4 4 6]

A simplified algorithm of item indexing looks as follows:

def get_item(row_index, column_index, matrix):
    # Get row values
    row_start = matrix.indptr[row_index]
    row_end = matrix.indptr[row_index + 1]
    row_values = matrix.data[row_start:row_end]

    # Get column indices of occupied values
    index_start = matrix.indptr[row_index]
    index_end = matrix.indptr[row_index + 1]

    # contains indices of occupied cells at a specific row
    row_indices = list(matrix.indices[index_start:index_end])

    # Find a positional index for a specific column index
    value_index = row_indices.index(column_index)

    if value_index >= 0:
        return row_values[value_index]
    else:
        # non-zero value is not found
        return 0
>>> get_item(0, 3, matrix)
2.0
>>> get_item(1, 3, matrix)
3.0

Advantages:

  • efficient item access and slicing
  • fast arithmetics
  • fast vector products

Disadvantages:

  • expensive editing

The Compressed Sparse Column (CSC) format is almost identical, except that values are indexed first by column with a column-major order. Usually, the CSC is used when there are more rows than columns. On the contrary, the CSR works better for a 'wide' format.

Block Sparse Row matrix (BSR) and DIAgonal storages

The diagonal storage (dia_matrix is scipy) is used when you need to store diagonal matrices. In scipy, the implementation is not limited to main diagonal only. All diagonals are stored using two arrays, one for data and one for diagonal offsets.

The block sparse row format is very similar to CSR, except it stores regular patterns of blocks (squares) which contain mostly non-zero data.

Category: python , tagged as: scipy, advanced python