You're reading from Graph Machine Learning Learn about the latest advancements in graph data to build robust machine learning models

Product type Paperback

Published in Jul 2025

Publisher Packt

ISBN-13 9781803248066

Length 434 pages

Edition 2nd Edition

Languages

Python

Tools

PyTorch

Concepts

Machine Learning

Authors (3):

Aldo Marzullo

Enrico Deusebio

Claudio Stamile

View More author details

Table of Contents (20) Chapters

Preface

1. Part 1: Introduction to Graph Machine Learning

2. Getting Started with Graphs FREE CHAPTER

3. Graph Machine Learning

4. Neural Networks and Graphs

5. Part 2: Machine Learning on Graphs

6. Unsupervised Graph Learning

7. Supervised Graph Learning

8. Solving Common Graph-Based Machine Learning Problems

9. Part 3: Practical Applications of Graph Machine Learning

10. Social Network Graphs

11. Text Analytics and Natural Language Processing Using Graphs

12. Graph Analysis for Credit Card Transactions

13. Building a Data-Driven Graph-Powered Application

14. Part 4: Advanced topics in Graph Machine Learning

15. Temporal Graph Machine Learning

16. GraphML and LLMs

17. Novel Trends on Graphs

18. Index

19. Other Books You May Enjoy

Download a Free PDF Copy of This Book

Data resources for network analysis

Digitalization has profoundly changed our lives, and today, any activity, person, or process generates data, providing a huge amount of information to be drilled into, analyzed, and used to promote data-driven decision-making. A few decades ago, it was hard to find datasets ready to be used to develop or test new algorithms. On the other hand, there exist today plenty of repositories that provide us with datasets, even of fairly large dimensions, to be downloaded and analyzed. These repositories, where people can share datasets, also provide a benchmark where algorithms can be applied, validated, and compared with each other.

In this section, we will briefly go through some of the main repositories and file formats used in network science, in order to provide you with all the tools needed to import datasets—of different sizes—to analyze and play around with.

In such repositories, you will find network datasets coming from some of the common areas of network science, such as social networks, biochemistry, dynamic networks, documents, co-authoring and citation networks, and networks arising from financial transactions. In Part 3, Advanced Applications of Graph Machine Learning, we will discuss some of the most common types of networks (social networks, graphs arising when processing corpus documents, and financial networks) and analyze them more thoroughly by applying the techniques and algorithms described in Part 2, Machine Learning on Graphs.

Also, networkx already comes with some basic (and very small) networks that are generally used to explain algorithms and basic measures, which can be found at https://networkx.org/documentation/stable/reference/generators.html#module-networkx.generators.social. These datasets are, however, generally quite small. For larger datasets, refer to the repositories we present next.

Network Repository

Network Repository is surely one of the largest repositories of network data (http://networkrepository.com/) with several thousand different networks, featuring users and donations from all over the world and top-tier academic institutions. If a network dataset is freely available, chances are that you will find it there. Datasets are classified into about 30 domains, including biology, economics, citations, social network data, industrial applications (energy, road), and many others. Besides providing the data, the website also provides a tool for interactive visualization, exploration, and comparison of datasets, and we suggest you check it out and explore it.

The data in Network Repository is generally available under the Matrix Market Exchange Format (MTX) file format. The MTX file format is basically a file format for specifying dense or sparse matrices, real or complex, via readable text files (American Standard Code for Information Interchange, or ASCII). For more details, please refer to http://math.nist.gov/MatrixMarket/formats.html#MMformat.

A file in MTX format can be easily read in Python using SciPy. Some of the files we downloaded from Network Repository seemed slightly corrupted and required a minimal fix on a 10.15.2 macOS system. In order to fix them, just make sure the header of the file is compliant with the format specifications—that is, with a double % and no spaces at the beginning of the line, as in the following line:

%%MatrixMarket matrix coordinate pattern symmetric

Matrices should be in coordinate format. In this case, the specification points also to an unweighted, undirected graph (as understood by pattern and symmetric). Some of the files have some comments after the first header line, which are preceded by a single %.

As an example, we consider the Astro Physics (ASTRO-PH) collaboration network. The graph is generated using all the scientific papers available from the e-print arXiv repository published in the Astrophysics category in the period from January 1993 to April 2003. The network is built by connecting (via undirected edges) all the authors that co-authored a publication, thus resulting in a clique that includes all authors of a given paper. The code to generate the graph can be seen here:

from scipy.io import mmread
adj_matrix = mmread("ca-AstroPh.mtx")
graph = nx.from_scipy_sparse_matrix(adj_matrix)

The dataset has 17,903 nodes, connected by 196,072 edges. Visualizing so many nodes cannot be done easily, and even if we were to do it, it might not be very informative, as understanding the underlying structure would not be very easy with so much information. However, we can get some insights by looking at specific subgraphs, as we will do next.

First, we can start by computing some basic properties we described earlier and put them into a pandas DataFrame for our convenience to later use, sort, and analyze. The code to accomplish this is illustrated in the following snippet (it may require several minutes to complete):

stats = pd.DataFrame({
    "centrality": nx.centrality.betweenness_centrality(graph),
    "C_i": nx.clustering(graph),
    "degree": nx.degree(graph)
})

We can easily find out that the node with the largest degree centrality is the one with ID 6933, which has 503 neighbors (surely a very popular and important scientist in astrophysics!), as illustrated in the following code snippet:

neighbors = [n for n in nx.neighbors(graph, 6933)]

Of course, also plotting its ego network (the node with all its neighbors) would still be a bit messy. One way to produce some subgraphs that can be plotted is by sampling (for example, with a 0.1 ratio) its neighbors in three different ways: random (sorting by index is a sort of random sorting), selecting the most central neighbors, or selecting the neighbors with the largest C_i values. The code to accomplish this is shown in the following code snippet:

sampling = 0.1 # this represents 10% of the neighbors
nTop = round(len(neighbors)*sampling)
idx = {
    "random": stats.loc[neighbors].sort_index().index[:nTop],
    "centrality": stats.loc[neighbors]\
         .sort_values("centrality", ascending=False)\
         .index[:nTop],
    "C_i": stats.loc[neighbors]\
         .sort_values("C_i", ascending=False)\
         .index[:nTop]
}

We can then define a simple function for extracting and plotting a subgraph that includes only the nodes related to certain indices, as shown in the following code snippet:

def plotSubgraph(graph, indices, center = 6933):
    nx.draw_kamada_kawai(
        nx.subgraph(graph, list(indices) + [center])
    )

Using the function above, we can plot the different subgraphs. Each subgraph will be obtained by filtering the ego network using three different criteria, based on random sampling, centrality, and the clustering coefficient. An example is provided here:

plotSubgraph(graph, idx["random"])

In Figure 1.23, we compare these results where the other networks have been obtained by changing the key value to centrality and C_i. The random representation seems to show some emerging structure with separated communities. The graph with the most central nodes clearly shows an almost fully connected network, possibly made up of all full professors and influential figures in astrophysics science, publishing on multiple topics and collaborating frequently with each other. Finally, the last representation, on the other hand, highlights some specific communities, possibly connected with a specific topic, by selecting the nodes that have a higher clustering coefficient. These nodes might not have a large degree of centrality, but they represent specific topics very well. You can see examples of the ego subgraph here:

Figure 1.23: Examples of the ego subgraph for the node that has the largest degree in the ASTRO-PH dataset. Neighbors are sampled with a ratio=0.1 random sampling (left); nodes with largest betweenness centrality (center); nodes with largest clustering coefficient (right)

Another option to visualize this in NetworkX could also be to use the Gephi software, which allows for fast filtering and visualizations of graphs. In order to do so, we need to first export the data in Graph Exchange XML Format (GEXF) (which is a file format that can be imported in Gephi), as follows:

nx.write_gext(graph, "ca-AstroPh.gext")

Once data is imported in Gephi, with a few filters (by centrality or degree) and some computations (modularity), you can easily do plots as nice as the one shown in Figure 1.24, where nodes have been colored using modularity in order to highlight clusters. Coloring also allows us to easily spot nodes that connect the different communities and that therefore have large betweenness.

Some of the datasets in Network Repository may also be available in the EDGE file format (for instance, the citation networks). The EDGE file format slightly differs from the MTX file format, although it represents the same information. Probably the easiest way to import such files into NetworkX is to convert them by simply rewriting its header. Take, for instance, the Digital Bibliography and Library Project (DBLP) citation network.

Figure 1.24: Example of the visualization ASTRO-PH dataset with Gephi. Nodes are filtered by degree centrality and colored by modularity class; node sizes are proportional to the value of the degree

The header of the file in this case reads:

% asym unweighted
% 49743 12591 12591

This can be easily converted to comply with the MTX file format by replacing these lines with the following code:

%%MatrixMarket matrix coordinate pattern general
12591 12591 49743

Then, you can use the import functions described previously.

Stanford Large Network Dataset Collection

Another valuable source of network datasets is the website of the Stanford Network Analysis Platform (SNAP) (https://snap.stanford.edu/index.html), which is a general-purpose network analysis library that was written in order to handle even fairly large graphs, with hundreds of millions of nodes and billions of edges. It is written in C++ to achieve top computational performance, but it also features interfaces with Python in order to be imported and used in native Python applications.

Although networkx is currently the main library to study networkx in Python, SNAP or other libraries (more on this shortly) can be orders of magnitude faster than networkx, and they may be used in place of networkx for tasks that require higher performance. On the SNAP website, you will find a specific web page for Biomedical Network Datasets (https://snap.stanford.edu/biodata/index.html), besides other more general networks (https://snap.stanford.edu/data/index.html), covering similar domains and datasets as Network Repository, described previously.

Data is generally provided in a text file format containing a list of edges. Reading such files can be done with networkx in one code line, using the following command:

g = nx.read_edgelist("amazon0302.txt")

Some graphs might have extra information, other than about edges. Extra information is included in the archive of the dataset as a separated file—for example, where some metadata of the nodes is provided and is related to the graph via the id node.

Graphs can also be read directly using the SNAP library and its interface via Python. If you have a working version of SNAP on your local machine, you can easily read the data as follows:

from snap import LoadEdgeList, PNGraph
graph = LoadEdgeList(PNGraph, "amazon0302.txt", 0, 1, '\t')

Keep in mind that, at this point, you will have an instance of a PNGraph object of the SNAP library, and you can’t directly use NetworkX functionalities on this object. If you want to use some NetworkX functions, you first need to convert the PNGraph object to a networkx object.

You can do that by creating a new graph and adding nodes and edges from PNGraph by using the networkx functionalities we have seen before.

Open Graph Benchmark

This is the most recent update (dated May 2020) in the graph benchmark landscape, and this repository is expected to gain increasing importance and support in the coming years. The Open Graph Benchmark (OGB) has been created to address one specific issue: current benchmarks are actually too small compared to real applications to be useful for ML advances. On the one hand, some of the models developed on small datasets turn out to not be able to scale to large datasets, proving them unsuitable in real-world applications. On the other hand, large datasets also allow us to increase the capacity (complexity) of the models used in ML tasks and explore new algorithmic solutions (such as neural networks) that can benefit from a large sample size to be efficiently trained, allowing us to achieve very high performance. The datasets belong to diverse domains and they have been ranked on three different dataset sizes (small, medium, and large), where the small graphs, despite their name, already have more than 100,000 nodes and/or more than 1 million edges. Conversely, large graphs feature networks with more than 100 million nodes and more than 1 billion edges, facilitating the development of scalable models.

Besides the datasets, the OGB also provides, in a Kaggle fashion, an end-to-end ML pipeline that standardizes the data loading, experimental setup, and model evaluation. OGB creates a platform to compare and evaluate models against each other, publishing a leaderboard that allows tracking of the performance evolution and advancements on specific tasks of node, edge, and graph property prediction. For more details on the datasets and the OGB project, please refer to the following paper by Hu et al. (2021): https://arxiv.org/pdf/2005.00687.pdf.