Graph Datasets

A collection of datasets with an underlying graph structure. Some of these datasets contain a single graph, that can be accessed with dataset[:] or dataset[1]. Others contain many graphs, accessed through dataset[i]. Graphs are represented by the MLDatasets.Graph and MLDatasets.HeteroGraph type.

Index

MLDatasets.AQSOL
MLDatasets.ChickenPox
MLDatasets.CiteSeer
MLDatasets.Cora
MLDatasets.Graph
MLDatasets.HeteroGraph
MLDatasets.KarateClub
MLDatasets.METRLA
MLDatasets.MovieLens
MLDatasets.OGBDataset
MLDatasets.OrganicMaterialsDB
MLDatasets.PEMSBAY
MLDatasets.PolBlogs
MLDatasets.PubMed
MLDatasets.Reddit
MLDatasets.TUDataset
MLDatasets.TemporalBrains
MLDatasets.WindMillEnergy

Documentation

MLDatasets.Graph — Type

Graph(; kws...)

A type that represents a graph and that can also store node and edge data. It doesn't distinguish between directed or undirected graph, therefore for undirected graphs will store edges in both directions. Nodes are indexed in 1:num_nodes.

Graph datasets in MLDatasets.jl contain one or more Graph or HeteroGraph objects.

Keyword Arguments

num_nodes: the number of nodes. If omitted, is inferred from edge_index.
edge_index: a tuple containing two vectors with length equal to the number of edges. The first vector contains the list of the source nodes of each edge, the second the target nodes. Defaults to (Int[], Int[]).
node_data: node-related data. Can be nothing, a named tuple of arrays or a dictionary of arrays. The arrays last dimension size should be equal to the number of nodes. Default nothing.
edge_data: edge-related data. Can be nothing, a named tuple of arrays or a dictionary of arrays. The arrays' last dimension size should be equal to the number of edges. Default nothing.

Examples

All graph datasets in MLDatasets.jl contain Graph or HeteroGraph objects:

julia> using MLDatasets: Cora

julia> d = Cora() # the Cora dataset
dataset Cora:
  metadata    =>    Dict{String, Any} with 3 entries
  graphs      =>    1-element Vector{Graph}

julia> d[1]
Graph:
  num_nodes   =>    2708
  num_edges   =>    10556
  edge_index  =>    ("10556-element Vector{Int64}", "10556-element Vector{Int64}")
  node_data   =>    (features = "1433×2708 Matrix{Float32}", targets = "2708-element Vector{Int64}", train_mask = "2708-element BitVector with 140 trues", val_mask = "2708-element BitVector with 500 trues", test_mask = "2708-element BitVector with 1000 trues")
  edge_data   =>    nothing

Let's se how to convert a Graphs.jl's graph to a MLDatasets.Graph and viceversa:

import Graphs, MLDatasets

## From Graphs.jl to MLDatasets.Graphs

# From a directed graph
g = Graphs.erdos_renyi(10, 20, is_directed=true)
s = [e.src for e in Graphs.edges(g)]
t = [e.dst for e in Graphs.edges(g)]
mlg = MLDatasets.Graph(num_nodes=10, edge_index=(s, t))

# From an undirected graph
g = Graphs.erdos_renyi(10, 20, is_directed=false)
s = [e.src for e in Graphs.edges(g)]
t = [e.dst for e in Graphs.edges(g)]
s, t = [s; t], [t; s] # adding reverse edges
mlg = MLDatasets.Graph(num_nodes=10, edge_index=(s, t))

# From MLDatasets.Graphs to Graphs.jl
s, t = mlg.edge_index
g = Graphs.DiGraph(mlg.num_nodes)
for (i, j) in zip(s, t)
    Graphs.add_edge!(g, i, j)
end

Filename	Description
structures.xyz	12500 crystal structures. Use the first 10000 as training examples and the remaining 2500 as test set.
bandgaps.csv	12500 DFT band gaps corresponding to structures.xyz
CODids.csv	12500 COD ids cross referencing the Crystallographic Open Database (in the same order as structures.xyz)