Graph Datasets

A collection of datasets with an underlying graph structure. Some of these datasets contain a single graph, that can be accessed with dataset[:] or dataset[1]. Others contain many graphs, accessed through dataset[i]. Graphs are represented by the MLDatasets.Graph and MLDatasets.HeteroGraph type.

Index

Documentation

MLDatasets.GraphType
Graph(; kws...)

A type that represents a graph and that can also store node and edge data. It doesn't distinguish between directed or undirected graph, therefore for undirected graphs will store edges in both directions. Nodes are indexed in 1:num_nodes.

Graph datasets in MLDatasets.jl contain one or more Graph or HeteroGraph objects.

Keyword Arguments

  • num_nodes: the number of nodes. If omitted, is inferred from edge_index.
  • edge_index: a tuple containing two vectors with length equal to the number of edges. The first vector contains the list of the source nodes of each edge, the second the target nodes. Defaults to (Int[], Int[]).
  • node_data: node-related data. Can be nothing, a named tuple of arrays or a dictionary of arrays. The arrays last dimension size should be equal to the number of nodes. Default nothing.
  • edge_data: edge-related data. Can be nothing, a named tuple of arrays or a dictionary of arrays. The arrays' last dimension size should be equal to the number of edges. Default nothing.

Examples

All graph datasets in MLDatasets.jl contain Graph or HeteroGraph objects:

julia> using MLDatasets: Cora

julia> d = Cora() # the Cora dataset
dataset Cora:
  metadata    =>    Dict{String, Any} with 3 entries
  graphs      =>    1-element Vector{Graph}

julia> d[1]
Graph:
  num_nodes   =>    2708
  num_edges   =>    10556
  edge_index  =>    ("10556-element Vector{Int64}", "10556-element Vector{Int64}")
  node_data   =>    (features = "1433×2708 Matrix{Float32}", targets = "2708-element Vector{Int64}", train_mask = "2708-element BitVector with 140 trues", val_mask = "2708-element BitVector with 500 trues", test_mask = "2708-element BitVector with 1000 trues")
  edge_data   =>    nothing

Let's se how to convert a Graphs.jl's graph to a MLDatasets.Graph and viceversa:

import Graphs, MLDatasets

## From Graphs.jl to MLDatasets.Graphs

# From a directed graph
g = Graphs.erdos_renyi(10, 20, is_directed=true)
s = [e.src for e in Graphs.edges(g)]
t = [e.dst for e in Graphs.edges(g)]
mlg = MLDatasets.Graph(num_nodes=10, edge_index=(s, t))

# From an undirected graph
g = Graphs.erdos_renyi(10, 20, is_directed=false)
s = [e.src for e in Graphs.edges(g)]
t = [e.dst for e in Graphs.edges(g)]
s, t = [s; t], [t; s] # adding reverse edges
mlg = MLDatasets.Graph(num_nodes=10, edge_index=(s, t))

# From MLDatasets.Graphs to Graphs.jl
s, t = mlg.edge_index
g = Graphs.DiGraph(mlg.num_nodes)
for (i, j) in zip(s, t)
    Graphs.add_edge!(g, i, j)
end
source
MLDatasets.HeteroGraphType
HeteroGraph(; kws...)

HeteroGraph is used for HeteroGeneous Graphs.

HeteroGraph unlike Graph can have different types of nodes. Each node pertains to different types of information.

Edges in HeteroGraph is defined by relations. A relation is a tuple of (src_node_type, edge_type, target_node_type) where edge_type represents the relation between the src and target nodes. Edges between same node types are possible.

A HeteroGraph can be directed or undirected. It doesn't distinguish between directed or undirected graphs. Therefore, for undirected graphs, it will store edges in both directions. Nodes are indexed in 1:num_nodes.

Keyword Arguments

  • num_nodes: Dictionary containing the number of nodes for each node type. If omitted, is inferred from edge_index.
  • num_edges: Dictionary containing the number of edges for each relation.
  • edge_indices: Dictionary containing the edge_index for each edge relation. An edge_index is a tuple containing two vectors with length equal to the number of edges for the relation. The first vector contains the list of the source nodes of each edge, the second contains the target nodes.
  • node_data: node-related data. Can be nothing, Dictionary of a dictionary of arrays. Data of a specific type of node can be accessed using nodedata[nodetype].The array's last dimension size should be equal to the number of nodes. Default nothing.
  • edge_data: Can be nothing, Dictionary of a dictionary of arrays. Data of a specific type of edge can be accessed using edgedata[edgetype].The array's last dimension size should be equal to the number of nodes. Default nothing.
source
MLDatasets.CoraType
Cora()

The Cora citation network dataset from Ref. [1]. Nodes represent documents and edges represent citation links. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper. The dataset is retrieved from Ref. [2].

Statistics

  • Nodes: 2708
  • Edges: 10556
  • Number of Classes: 7
  • Label split:
    • Train: 140
    • Val: 500
    • Test: 1000

The split is the one used in the original paper [1] and doesn't consider all nodes.

References

[1]: Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking

[2]: Planetoid

source
MLDatasets.KarateClubType
KarateClub()

The Zachary's karate club dataset originally appeared in Ref [1].

The network contains 34 nodes (members of the karate club). The nodes are connected by 78 undirected and unweighted edges. The edges indicate if the two members interacted outside the club.

The node labels indicate which community or the karate club the member belongs to. The club based labels are as per the original dataset in Ref [1]. The community labels are obtained by modularity-based clustering following Ref [2]. The data is retrieved from Ref [3] and [4]. One node per unique label is used as training data.

References

[1]: An Information Flow Model for Conflict and Fission in Small Groups

[2]: Semi-supervised Classification with Graph Convolutional Networks

[3]: PyTorch Geometric Karate Club Dataset

[4]: NetworkX Zachary's Karate Club Dataset

source
MLDatasets.MovieLensType
MovieLens(name; dir=nothing)

Datasets from the MovieLens website collected and maintained by GroupLens. The MovieLens datasets are presented in a Graph format. For license and usage restrictions please refer to the Readme.md of the datasets.

There are 6 versions of MovieLens datasets currently supported: "100k", "1m", "10m", "20m", "25m", "latest-small". The 100k and 1k datasets contain movie data and rating data along with demographic data. Starting from the 10m dataset, MovieLens datasets no longer contain the demographic data. These datasets contain movie data, rating data, and tag information.

The 20m and 25m datasets additionally contain genome tag scores. Each movie in these datasets contains tag relevance scores for every tag.

Each dataset contains an heterogeneous graph, with two kinds of nodes, movie and user. The rating is represented by an edge between them: (user, rating, movie). 20m, 25m, and latest-small datasets also contain tag nodes and edges of type (user, tag, movie) and optionally (movie, score, tag).

Examples

MovieLens 100K dataset

julia> data = MovieLens("100k")
MovieLens 100k:
  metadata    =>    Dict{String, Any} with 2 entries
  graphs      =>    1-element Vector{MLDatasets.HeteroGraph}

julia> metadata = data.metadata
Dict{String, Any} with 2 entries:
  "genre_labels"      => ["Unknown", "Action", "Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary", "Drama", "Fa…
  "movie_id_to_title" => Dict(1144=>"Quiet Room, The (1996)", 1175=>"Hugo Pool (1997)", 719=>"Canadian Bacon (1994)", 1546=>"Shadow…

julia> g = data[:]
  Heterogeneous Graph:
    node_types    =>    2-element Vector{String}
    edge_types    =>    1-element Vector{Tuple{String, String, String}}
    num_nodes     =>    Dict{String, Int64} with 2 entries
    num_edges     =>    Dict{Tuple{String, String, String}, Int64} with 1 entry
    edge_indices  =>    Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 1 entry
    node_data     =>    Dict{String, Dict} with 2 entries
    edge_data     =>    Dict{Tuple{String, String, String}, Dict} with 1 entry

# Access the user information
julia> user_data = g.node_data["user"]
Dict{Symbol, AbstractVector} with 4 entries:
  :age        => [24, 53, 23, 24, 33, 42, 57, 36, 29, 53  …  61, 42, 24, 48, 38, 26, 32, 20, 48, 22]
  :occupation => ["technician", "other", "writer", "technician", "other", "executive", "administrator", "administrator", "student",…
  :zipcode    => ["85711", "94043", "32067", "43537", "15213", "98101", "91344", "05201", "01002", "90703"  …  "22902", "66221", "3…
  :gender     => Bool[1, 0, 1, 1, 0, 1, 1, 1, 1, 1  …  1, 1, 1, 1, 0, 0, 1, 1, 0, 1]

# Access rating information
julia> g.edge_data[("user", "rating", "movie")]
Dict{Symbol, Vector} with 2 entries:
  :timestamp => [881250949, 891717742, 878887116, 880606923, 886397596, 884182806, 881171488, 891628467, 886324817, 883603013  …  8…
  :rating    => Float16[3.0, 3.0, 1.0, 2.0, 1.0, 4.0, 2.0, 5.0, 3.0, 3.0  …  4.0, 4.0, 3.0, 2.0, 3.0, 3.0, 5.0, 1.0, 2.0, 3.0]

MovieLens 20m dataset

julia> data = MovieLens("20m")
MovieLens 20m:
  metadata    =>    Dict{String, Any} with 4 entries
  graphs      =>    1-element Vector{MLDatasets.HeteroGraph}

# There is only 1 graph in MovieLens dataset
julia> g = data[1]
Heterogeneous Graph:
  node_types    =>    3-element Vector{String}
  edge_types    =>    3-element Vector{Tuple{String, String, String}}
  num_nodes     =>    Dict{String, Int64} with 3 entries
  num_edges     =>    Dict{Tuple{String, String, String}, Int64} with 3 entries
  edge_indices  =>    Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 3 entries
  node_data     =>    Dict{String, Dict} with 0 entries
  edge_data     =>    Dict{Tuple{String, String, String}, Dict} with 3 entries

# Apart from user rating a movie, a user assigns tag to movies and there are genome-scores for movie-tag pairs 
julia> g.edge_indices
  Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 3 entries:
    ("movie", "score", "tag")   => ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1  …  131170, 131170, 131170, 131170, 131170, 131170, 131170, 131170,…
    ("user", "tag", "movie")    => ([18, 65, 65, 65, 65, 65, 65, 65, 65, 65  …  3489, 7045, 7045, 7164, 7164, 55999, 55999, 55999, 55…
    ("user", "rating", "movie") => ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1  …  60816, 61160, 65682, 66762, 68319, 68954, 69526, 69644, 70286, …

# Access the rating
julia> g.edge_data[("user", "rating", "movie")]
Dict{Symbol, Vector} with 2 entries:
  :timestamp => [1112486027, 1112484676, 1112484819, 1112484727, 1112484580, 1094785740, 1094785734, 1112485573, 1112484940, 111248…
  :rating    => Float16[3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 4.0, 4.0, 4.0, 4.0  …  4.5, 4.0, 4.5, 4.5, 4.5, 4.5, 4.5, 3.0, 5.0, 2.5]

# Access the movie-tag scores
score = g.edge_data[("movie", "score", "tag")][:score]
23419536-element Vector{Float64}:
 0.025000000000000022
 0.025000000000000022
 0.057750000000000024
 ⋮

References

[1] GroupLens Website

[2] TensorFlow MovieLens Implementation

[3] Jesse Vig, Shilad Sen, and John Riedl. 2012. The Tag Genome: Encoding Community Knowledge to Support Novel Interaction. ACM Trans. Interact. Intell. Syst. 2, 3, Article 13 (September 2012), 44 pages. https://doi.org/10.1145/2362394.2362395.

[4] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (January 2016), 19 pages. https://doi.org/10.1145/2827872

source
MLDatasets.OGBDatasetType
OGBDataset(name; dir=nothing)

The collection of datasets from the Open Graph Benchmark: Datasets for Machine Learning on Graphs paper.

name is the name of one of the datasets (listed here) available for node prediction, edge prediction, or graph prediction tasks.

Examples

Node prediction tasks

julia> data = OGBDataset("ogbn-arxiv")
OGBDataset ogbn-arxiv:
  metadata    =>    Dict{String, Any} with 17 entries
  graphs      =>    1-element Vector{MLDatasets.Graph}
  graph_data  =>    nothing

julia> data[:]
Graph:
  num_nodes   =>    169343
  num_edges   =>    1166243
  edge_index  =>    ("1166243-element Vector{Int64}", "1166243-element Vector{Int64}")
  node_data   =>    (val_mask = "29799-trues BitVector", test_mask = "48603-trues BitVector", year = "169343-element Vector{Int64}", features = "128×169343 Matrix{Float32}", label = "169343-element Vector{Int64}", train_mask = "90941-trues BitVector")
  edge_data   =>    nothing

julia> data.metadata
Dict{String, Any} with 17 entries:
  "download_name"         => "arxiv"
  "num classes"           => 40
  "num tasks"             => 1
  "binary"                => false
  "url"                   => "http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip"
  "additional node files" => ["node_year"]
  "is hetero"             => false
  "task level"            => "node"
  ⋮                       => ⋮

julia> data = OGBDataset("ogbn-mag")
OGBDataset ogbn-mag:
  metadata    =>    Dict{String, Any} with 17 entries
  graphs      =>    1-element Vector{MLDatasets.HeteroGraph}
  graph_data  =>    nothing

julia> data[:]
Heterogeneous Graph:
  num_nodes     =>    Dict{String, Int64} with 4 entries
  num_edges     =>    Dict{Tuple{String, String, String}, Int64} with 4 entries
  edge_indices  =>    Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 4 entries
  node_data     =>    (year = "Dict{String, Vector{Float32}} with 1 entry", features = "Dict{String, Matrix{Float32}} with 1 entry", label = "Dict{String, Vector{Int64}} with 1 entry")
  edge_data     =>    (reltype = "Dict{Tuple{String, String, String}, Vector{Float32}} with 4 entries",)

Edge prediction task

julia> data = OGBDataset("ogbl-collab")
OGBDataset ogbl-collab:
  metadata    =>    Dict{String, Any} with 15 entries
  graphs      =>    1-element Vector{MLDatasets.Graph}
  graph_data  =>    nothing

julia> data[:]
Graph:
  num_nodes   =>    235868
  num_edges   =>    2358104
  edge_index  =>    ("2358104-element Vector{Int64}", "2358104-element Vector{Int64}")
  node_data   =>    (features = "128×235868 Matrix{Float32}",)
  edge_data   =>    (year = "2×1179052 Matrix{Int64}", weight = "2×1179052 Matrix{Int64}")

Graph prediction task

julia> data = OGBDataset("ogbg-molhiv")
OGBDataset ogbg-molhiv:
  metadata    =>    Dict{String, Any} with 17 entries
  graphs      =>    41127-element Vector{MLDatasets.Graph}
  graph_data  =>    (labels = "41127-element Vector{Int64}", train_mask = "32901-trues BitVector", val_mask = "4113-trues BitVector", test_mask = "4113-trues BitVector")

julia> data[1]
(graphs = Graph(19, 40), labels = 0)
source
MLDatasets.OrganicMaterialsDBType
OrganicMaterialsDB(; split=:train, dir=nothing)

The OMDB-GAP1 v1.1 dataset from the Organic Materials Database (OMDB) of bulk organic crystals.

The dataset has to be manually downloaded from https://omdb.mathub.io/dataset, then unzipped and its file content placed in the OrganicMaterialsDB folder.

The dataset contains the following files:

FilenameDescription
structures.xyz12500 crystal structures. Use the first 10000 as training examples and the remaining 2500 as test set.
bandgaps.csv12500 DFT band gaps corresponding to structures.xyz
CODids.csv12500 COD ids cross referencing the Crystallographic Open Database (in the same order as structures.xyz)

Please cite the paper introducing this dataset: https://arxiv.org/abs/1810.12814

source
MLDatasets.RedditType
Reddit(; full=true, dir=nothing)

The Reddit dataset was introduced in Ref [1]. It is a graph dataset of Reddit posts made in the month of September, 2014. The dataset contains a single post-to-post graph, connecting posts if the same user comments on both. The node label in this case is one of the 41 communities, or “subreddit”s, that a post belongs to. This dataset contains 232,965 posts. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). Each node is represented by a 602 word vector.

Use full=false to load only a subsample of the dataset.

References

[1]: Inductive Representation Learning on Large Graphs

[2]: Benchmarks on the Reddit Dataset

source
MLDatasets.TUDatasetType
TUDataset(name; dir=nothing)

A variety of graph benchmark datasets, .e.g. "QM9", "IMDB-BINARY", "REDDIT-BINARY" or "PROTEINS", collected from the TU Dortmund University. Retrieve from the TUDataset collection the dataset name, where name is any of the datasets available here.

A TUDataset object can be indexed to retrieve a specific graph or a subset of graphs.

See here for an in-depth description of the format.

Usage Example

julia> data = TUDataset("PROTEINS")
dataset TUDataset:
  name        =>    PROTEINS
  metadata    =>    Dict{String, Any} with 1 entry
  graphs      =>    1113-element Vector{MLDatasets.Graph}
  graph_data  =>    (targets = "1113-element Vector{Int64}",)
  num_nodes   =>    43471
  num_edges   =>    162088
  num_graphs  =>    1113

julia> data[1]
(graphs = Graph(42, 162), targets = 1)
source
MLDatasets.METRLAType
METRLA(; num_timesteps_in::Int = 12, num_timesteps_out::Int=12, dir=nothing, normalize = true)

The METR-LA dataset from the Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting paper.

METRLA is a graph with 207 nodes representing traffic sensors in Los Angeles.

The edge weights w are contained as a feature array in edge_data and represent the distance between the sensors.

The node features are the traffic speed and the time of the measurements collected by the sensors, divided into num_timesteps_in time steps.

The target values are the traffic speed of the measurements collected by the sensors, divided into num_timesteps_out time steps.

The normalize flag indicates whether the data are normalized using Z-score normalization.

source
MLDatasets.PEMSBAYType
PEMSBAY(; num_timesteps_in::Int = 12, num_timesteps_out::Int=12, dir=nothing, normalize = true)

The PEMS-BAY dataset described in the Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting paper. It is collected by California Transportation Agencies (Cal- Trans) Performance Measurement System (PeMS).

PEMSBAY is a graph with 325 nodes representing traffic sensors in the Bay Area.

The edge weights w are contained as a feature array in edge_data and represent the distance between the sensors.

The node features are the traffic speed and the time of the measurements collected by the sensors, divided into num_timesteps_in time steps.

The target values are the traffic speed of the measurements collected by the sensors, divided into num_timesteps_out time steps.

The normalize flag indicates whether the data are normalized using Z-score normalization.

source
MLDatasets.TemporalBrainsType
TemporalBrains(; dir = nothing, threshold_value = 0.6)

The TemporalBrains dataset contains a collection of temporal brain networks (as TemporalSnapshotsGraphs) of 1000 subjects obtained from resting-state fMRI data from the Human Connectome Project (HCP).

The number of nodes is fixed for each of the 27 snapshots at 102, while the edges change over time.

For each Graph snapshot, the feature of a node represents the average activation of the node during that snapshot and it is contained in Graphs.node_data.

Each TemporalSnapshotsGraph has a label representing their gender ("M" for male and "F" for female) and age range (22-25, 26-30, 31-35 and 36+) contained as a named tuple in graph_data.

The threshold_value is used to binarize the edge weights and is set to 0.6 by default.

source