What is a Sparse Matrix? Imagine you have a two-dimensional data set with 10 rows and 10 columns such that each element contains a value. We can also call such data as matrix, in this example it is a dense 10 x 10 matrix. Now imagine, you have a 10 x 10 matrix with only very few elements of the matrix is non-zero. In that case, storing the data in such a two dimensional data structure is waste of space. When the dimension of such data is large, it becomes almost impossible to use/store.
Sparse matrices are memory efficient data structures that enable us store large matrices with very few non-zero elements aka sparse matrices. In addition to efficient storage, sparse matrix data structure also allows us to perform complex matrix computations. The ability to do such computations is incredibly powerful in a variety of data science problems. Learning to work with Sparse matrix, a large matrix or 2d-array with a lot elements being zero, can be extremely handy.
Python’s SciPy library has a lot of options for creating, storing, and operating with Sparse matrices. There are 7 different types of sparse matrices available.
- bsr_matrix: Block Sparse Row matrix
- coo_matrix: COOrdinate format matrix
- csc_matrix: Compressed Sparse Column matrix
- csr_matrix: Compressed Sparse Row matrix
- dia_matrix: Sparse matrix with DIAgonal storage
- dok_matrix: Dictionary Of Keys based sparse matrix.
- lil_matrix: Row-based linked list sparse matrix
How to Choose the Right Sparse Matrix?
Each of these sparse matrix are efficient and fast for specific operations. For example, if you want to construct a new sparse matrix from scratch lil_matrix or dok_matrix are efficient. However, arithmetic operations on matrices are not efficient. coo_matrix has similar properties; good for creating sparse matrix, but bad for operations.
If you are interested in matrix operations, like multiplication or inversion either CSC or CSR sparse matrix format is more suitable/efficient. Due to the nature of the data structure, csc_matrix has faster/efficient column slicing, while csr_matrix has faster row slicing.
In this post, we will see a few simple examples of creating sparse matrix and using them in Python. Let us get started with loading the necessary packages/modules upfront. We will be using SciPy’s sparse module for the sparse matrices.
Choosing the right sparse matrix depends on the application. Typically, you may have to use multiple sparse matrix formats to get the job done. SciPy’s sparse module has really nice functions to convert one sparse matrix type to another.
# import sparse module from SciPy package from scipy import sparse # import uniform module to create random numbers from scipy.stats import uniform # import NumPy import numpy as np
How to Create COO sparse matrix?
One of the more intuitive matrices is COOordinate sparse matrix. One can create COO sparse matrix fast. We basically need the co-ordinates of non-zero elements in the sparse matrix.
To create a coo_matrix we need 3 one-dimensional numpy arrays. The first array represents the row indices, the second array represents column indices and the third array represents non-zero data in the element. The row and column indices specify the location of non-zero element and the data array specifies the actual non-zero data in it.
Let us create a sparse matrix in COO format using simple example. Let us first create 3 numpy arrays needed to create COO sparse matrix.
# row indices row_ind = np.array([0, 1, 1, 3, 4]) # column indices col_ind = np.array([0, 2, 4, 3, 4]) # data to be stored in COO sparse matrix data = np.array([1, 2, 3, 4, 5], dtype=float)
We can use sparse.coo_matrix to create sparse matrix in COO format. It takes data and the row and column index tuple as arguments.
# create COO sparse matrix from three arrays >mat_coo = sparse.coo_matrix((data, (row_ind, col_ind))) # print coo_matrix >print(mat_coo) (0, 0) 1.0 (1, 2) 2.0 (1, 4) 3.0 (3, 3) 4.0 (4, 4) 5.0
coo_matrix has lots of useful functions including function to convert coo_matrix to other sparse matrices and also to dense matrix. Here is a function toarray to see the 2d-array of the sparse matrix that we just created.
>print(mat_coo.toarray()) [[ 1. 0. 0. 0. 0.] [ 0. 0. 2. 0. 3.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 4. 0.] [ 0. 0. 0. 0. 5.]]
Here is an example to convert coo matrix to CSC sparse matrix.
>print(mat_coo.tocsc()) (0, 0) 1.0 (1, 2) 2.0 (3, 3) 4.0 (1, 4) 3.0 (4, 4) 5.0
Note the order of data stored in CSC format is different from COO sparse matrix.
How to Create Sparse Matrix from a Dense (full) Matrix?
In the above example of creating COO matrix we had our data in nice format and thus we could use it create COO sparse matrix. Often, we start with a full matrix as input. Here is an example showing how to use existing 2d-array/matrix to create a sparse matrix.
This time we will create csr_matrix sparse matrix. And we will also create the full matrix using random numbers from uniform distribution in SciPy.stats.
Let us first set a seed for random number generation, so that we can reproduce the same random numbers. We will use SciPy.stats module to create a toy sparse matrix with just 4 rows and 4 columns.
We will first create uniform random numbers from 0 to 2 in a 1d NumPy array. And then use reshape function to make it a 2d-numpy array i.e. a matrix.
np.random.seed(seed=42) data = uniform.rvs(size=16, loc = 0, scale=2) data = np.reshape(data, (4, 4)) data
We can see that we have created 4×4 2d-array with uniform random numbers.
array([[ 0.74908024, 1.90142861, 1.46398788, 1.19731697], [ 0.31203728, 0.31198904, 0.11616722, 1.73235229], [ 1.20223002, 1.41614516, 0.04116899, 1.9398197 ], [ 1.66488528, 0.42467822, 0.36364993, 0.36680902]])
Let us convert this full matrix into a sparse matrix. Let us first make some of the elements of matrix zero. Here any element with values less than 1 will be assigned to 0. Now half the elements of this matrix are zero.
# make elements with value less < 1 to zero data[data < 1] = 0
We can see that elements with value less than 1 is zero now.
>data array([[ 0. , 1.90142861, 1.46398788, 1.19731697], [ 0. , 0. , 0. , 1.73235229], [ 1.20223002, 1.41614516, 0. , 1.9398197 ], [ 1.66488528, 0. , 0. , 0. ]])
Let us convert this full matrix with zeroes to sparse matrix using sparse module in SciPy. As you just saw, SciPy has multiple options for sparse matrices. We will be using csr_matrix, where csr stands for Compressed Sparse Row.
data_csr = sparse.csr_matrix(data)
We can also print the small sparse matrix to see how the data is stored.
print(data_csr) (0, 1) 1.90142861282 (0, 2) 1.46398788362 (0, 3) 1.19731696839 (1, 3) 1.73235229155 (2, 0) 1.20223002349 (2, 1) 1.41614515559 (2, 3) 1.93981970432 (3, 0) 1.6648852816
We can see that in the csr sparse matrix , we have only nonzero elements. Also the elements are stored row wise, leaving any zero element. The toy example showed how to create sparse matrix from a full matrix in Python.
How much space do we gain by storing a big sparse matrix in SciPy.sparse?
One of the real uses of sparse matrix is the huge space reduction to store sparse matrices. Let us create a bigger full matrix using uniform random numbers.
np.random.seed(seed=42) data = uniform.rvs(size=1000000, loc = 0, scale=2) data = np.reshape(data, (10000, 100))
Let us make the matrix sparse by making certain elements zero. As before, we make any element whose value is less than 1 to 0. We can use nbytes function in NumPy to get the number of bytes and get the size of the matrix in MB.
>data[data < 1] = 0 >data_size = data.nbytes/(1024**2) >print('Size of full matrix with zeros: '+ '%3.2f' %data_size + ' MB') Size of full matrix with zeros: 76.29 MB
We can see the size of full matrix of size 1 Million elements with half of them with values zero is about 80 MB.
>data_csr = sparse.csr_matrix(data) >data_csr_size = data_csr.data.size/(1024**2) >print('Size of sparse csr_matrix: '+ '%3.2f' %data_csr_size + ' MB') Size of sparse csr_matrix: 4.77 MB
With the use of sparse matrix, the size of the data in the sparse matrix is just about 5MB, a huge reduction is space. This is mainly due efficient data structure to store only the non-zero elements.