Graph Datasets
A collection of datasets with an underlying graph structure. Some of these datasets contain a single graph, that can be accessed with dataset[:]
or dataset[1]
. Others contain many graphs, accessed through dataset[i]
. Graphs are represented by the MLDatasets.Graph
and MLDatasets.HeteroGraph
type.
Index
MLDatasets.AQSOL
MLDatasets.ChickenPox
MLDatasets.CiteSeer
MLDatasets.Cora
MLDatasets.Graph
MLDatasets.HeteroGraph
MLDatasets.KarateClub
MLDatasets.METRLA
MLDatasets.MovieLens
MLDatasets.OGBDataset
MLDatasets.OrganicMaterialsDB
MLDatasets.PEMSBAY
MLDatasets.PolBlogs
MLDatasets.PubMed
MLDatasets.Reddit
MLDatasets.TUDataset
MLDatasets.TemporalBrains
MLDatasets.WindMillEnergy
Documentation
MLDatasets.Graph
— TypeGraph(; kws...)
A type that represents a graph and that can also store node and edge data. It doesn't distinguish between directed or undirected graph, therefore for undirected graphs will store edges in both directions. Nodes are indexed in 1:num_nodes
.
Graph datasets in MLDatasets.jl contain one or more Graph
or HeteroGraph
objects.
Keyword Arguments
num_nodes
: the number of nodes. If omitted, is inferred fromedge_index
.edge_index
: a tuple containing two vectors with length equal to the number of edges. The first vector contains the list of the source nodes of each edge, the second the target nodes. Defaults to(Int[], Int[])
.node_data
: node-related data. Can benothing
, a named tuple of arrays or a dictionary of arrays. The arrays last dimension size should be equal to the number of nodes. Defaultnothing
.edge_data
: edge-related data. Can benothing
, a named tuple of arrays or a dictionary of arrays. The arrays' last dimension size should be equal to the number of edges. Defaultnothing
.
Examples
All graph datasets in MLDatasets.jl contain Graph
or HeteroGraph
objects:
julia> using MLDatasets: Cora
julia> d = Cora() # the Cora dataset
dataset Cora:
metadata => Dict{String, Any} with 3 entries
graphs => 1-element Vector{Graph}
julia> d[1]
Graph:
num_nodes => 2708
num_edges => 10556
edge_index => ("10556-element Vector{Int64}", "10556-element Vector{Int64}")
node_data => (features = "1433×2708 Matrix{Float32}", targets = "2708-element Vector{Int64}", train_mask = "2708-element BitVector with 140 trues", val_mask = "2708-element BitVector with 500 trues", test_mask = "2708-element BitVector with 1000 trues")
edge_data => nothing
Let's se how to convert a Graphs.jl's graph to a MLDatasets.Graph
and viceversa:
import Graphs, MLDatasets
## From Graphs.jl to MLDatasets.Graphs
# From a directed graph
g = Graphs.erdos_renyi(10, 20, is_directed=true)
s = [e.src for e in Graphs.edges(g)]
t = [e.dst for e in Graphs.edges(g)]
mlg = MLDatasets.Graph(num_nodes=10, edge_index=(s, t))
# From an undirected graph
g = Graphs.erdos_renyi(10, 20, is_directed=false)
s = [e.src for e in Graphs.edges(g)]
t = [e.dst for e in Graphs.edges(g)]
s, t = [s; t], [t; s] # adding reverse edges
mlg = MLDatasets.Graph(num_nodes=10, edge_index=(s, t))
# From MLDatasets.Graphs to Graphs.jl
s, t = mlg.edge_index
g = Graphs.DiGraph(mlg.num_nodes)
for (i, j) in zip(s, t)
Graphs.add_edge!(g, i, j)
end
MLDatasets.HeteroGraph
— TypeHeteroGraph(; kws...)
HeteroGraph is used for HeteroGeneous Graphs.
HeteroGraph
unlike Graph
can have different types of nodes. Each node pertains to different types of information.
Edges in HeteroGraph
is defined by relations. A relation is a tuple of (src_node_type
, edge_type
, target_node_type
) where edge_type
represents the relation between the src and target nodes. Edges between same node types are possible.
A HeteroGraph
can be directed or undirected. It doesn't distinguish between directed or undirected graphs. Therefore, for undirected graphs, it will store edges in both directions. Nodes are indexed in 1:num_nodes
.
Keyword Arguments
num_nodes
: Dictionary containing the number of nodes for each node type. If omitted, is inferred fromedge_index
.num_edges
: Dictionary containing the number of edges for each relation.edge_indices
: Dictionary containing theedge_index
for each edge relation. Anedge_index
is a tuple containing two vectors with length equal to the number of edges for the relation. The first vector contains the list of the source nodes of each edge, the second contains the target nodes.node_data
: node-related data. Can benothing
, Dictionary of a dictionary of arrays. Data of a specific type of node can be accessed using nodedata[nodetype].The array's last dimension size should be equal to the number of nodes. Defaultnothing
.edge_data
: Can benothing
, Dictionary of a dictionary of arrays. Data of a specific type of edge can be accessed using edgedata[edgetype].The array's last dimension size should be equal to the number of nodes. Defaultnothing
.
MLDatasets.AQSOL
— TypeAQSOL(; split=:train, dir=nothing)
The AQSOL (Aqueous Solubility) dataset from the paper Graph Neural Network for Predicting Aqueous Solubility of Organic Molecules.
The dataset contains 9,882 graphs representing small organic molecules. Each graph represents a molecule, where nodes correspond to atoms and edges to bonds. The node features represent the atomic number, and the edge features represent the bond type. The target is the aqueous solubility of the molecule, measured in mol/L.
Arguments
split
: Which split of the dataset to load. Can be one of:train
,:val
, or:test
. Defaults to:train
.dir
: Directory in which the dataset is in.
Examples
julia> using MLDatasets
julia> data = AQSOL()
dataset AQSOL:
split => :train
metadata => Dict{String, Any} with 1 entry
graphs => 7985-element Vector{MLDatasets.Graph}
julia> length(data)
7985
julia> g = data[1]
Graph:
num_nodes => 23
num_edges => 42
edge_index => ("42-element Vector{Int64}", "42-element Vector{Int64}")
node_data => (features = "23-element Vector{Int64}",)
edge_data => (features = "42-element Vector{Int64}",)
julia> g.num_nodes
23
julia> g.node_data.features
23-element Vector{Int64}:
0
1
1
⋮
1
1
1
julia> g.edge_index
([2, 3, 3, 4, 4, 5, 5, 6, 6, 7 … 18, 19, 19, 20, 20, 21, 20, 22, 20, 23], [3, 2, 4, 3, 5, 4, 6, 5, 7, 6 … 19, 18, 20, 19, 21, 20, 22, 20, 23, 20])
MLDatasets.ChickenPox
— TypeChickenPox(; normalize= true, num_timesteps_in = 8 , num_timesteps_out = 8, dir = nothing)
The ChickenPox dataset contains county-level chickenpox cases in Hungary between 2004 and 2014.
ChickenPox
is composed of a graph with nodes representing counties and edges representing the neighborhoods, and a metadata dictionary containing the correspondence between the node indices and the county names.
The node features are the number of weekly chickenpox cases in each county. They are represented as an array of arrays of size (1, num_nodes, num_timesteps_in)
. The target values are the number of weekly chickenpox cases in each county. They are represented as an array of arrays of size (1, num_nodes, num_timesteps_out)
. In both cases. two consecutive arrays are shifted by one-time step.
The dataset was taken from the Pytorch Geometric Temporal repository and more information about the dataset can be found in the paper "Chickenpox Cases in Hungary: a Benchmark Dataset for Spatiotemporal Signal Processing with Graph Neural Networks".
Keyword Arguments
normalize::Bool
: Whether to normalize the data using Z-score normalization. Default istrue
.num_timesteps_in::Int
: The number of time steps, in this case, the number of weeks, for the input features. Default is8
.num_timesteps_out::Int
: The number of time steps, in this case, the number of weeks, for the target values. Default is8
.dir::String
: The directory to save the dataset. Default isnothing
.
Examples
julia> using JSON3 # import JSON3
julia> dataset = ChickenPox()
dataset ChickenPox:
metadata => Dict{Symbol, Any} with 20 entries
graphs => 1-element Vector{MLDatasets.Graph}
julia> dataset.graphs[1].num_nodes # 20 counties
20
julia> size(dataset.graphs[1].node_data.features[1])
(1, 20, 8)
julia> dataset.metadata[:BUDAPEST] # The node 5 correponds to Budapest county
5
MLDatasets.CiteSeer
— TypeCiteSeer(; dir=nothing)
The CiteSeer citation network dataset from Ref. [1]. Nodes represent documents and edges represent citation links. The dataset is designed for the node classification task. The task is to predict the category of certain paper. The dataset is retrieved from Ref. [2].
References
[1]: Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking
[2]: Planetoid
MLDatasets.Cora
— TypeCora()
The Cora citation network dataset from Ref. [1]. Nodes represent documents and edges represent citation links. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper. The dataset is retrieved from Ref. [2].
Statistics
- Nodes: 2708
- Edges: 10556
- Number of Classes: 7
- Label split:
- Train: 140
- Val: 500
- Test: 1000
The split is the one used in the original paper [1] and doesn't consider all nodes.
References
[1]: Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking
[2]: Planetoid
MLDatasets.KarateClub
— TypeKarateClub()
The Zachary's karate club dataset originally appeared in Ref [1].
The network contains 34 nodes (members of the karate club). The nodes are connected by 78 undirected and unweighted edges. The edges indicate if the two members interacted outside the club.
The node labels indicate which community or the karate club the member belongs to. The club based labels are as per the original dataset in Ref [1]. The community labels are obtained by modularity-based clustering following Ref [2]. The data is retrieved from Ref [3] and [4]. One node per unique label is used as training data.
References
[1]: An Information Flow Model for Conflict and Fission in Small Groups
[2]: Semi-supervised Classification with Graph Convolutional Networks
MLDatasets.METRLA
— TypeMETRLA(; num_timesteps_in::Int = 12, num_timesteps_out::Int=12, dir=nothing, normalize = true)
The METR-LA dataset from the Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting paper.
METRLA
is a graph with 207 nodes representing traffic sensors in Los Angeles.
The edge weights w
are contained as a feature array in edge_data
and represent the distance between the sensors.
The node features are the traffic speed and the time of the measurements collected by the sensors, divided into num_timesteps_in
time steps.
The target values are the traffic speed of the measurements collected by the sensors, divided into num_timesteps_out
time steps.
The normalize
flag indicates whether the data are normalized using Z-score normalization.
MLDatasets.MovieLens
— TypeMovieLens(name; dir=nothing)
Datasets from the MovieLens website collected and maintained by GroupLens. The MovieLens datasets are presented in a Graph format. For license and usage restrictions please refer to the Readme.md of the datasets.
There are 6 versions of MovieLens datasets currently supported: "100k", "1m", "10m", "20m", "25m", "latest-small". The 100k and 1k datasets contain movie data and rating data along with demographic data. Starting from the 10m dataset, MovieLens datasets no longer contain the demographic data. These datasets contain movie data, rating data, and tag information.
The 20m and 25m datasets additionally contain genome tag scores. Each movie in these datasets contains tag relevance scores for every tag.
Each dataset contains an heterogeneous graph, with two kinds of nodes, movie
and user
. The rating is represented by an edge between them: (user, rating, movie)
. 20m, 25m, and latest-small datasets also contain tag
nodes and edges of type (user, tag, movie)
and optionally (movie, score, tag)
.
Examples
MovieLens 100K dataset
julia> data = MovieLens("100k")
MovieLens 100k:
metadata => Dict{String, Any} with 2 entries
graphs => 1-element Vector{MLDatasets.HeteroGraph}
julia> metadata = data.metadata
Dict{String, Any} with 2 entries:
"genre_labels" => ["Unknown", "Action", "Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary", "Drama", "Fa…
"movie_id_to_title" => Dict(1144=>"Quiet Room, The (1996)", 1175=>"Hugo Pool (1997)", 719=>"Canadian Bacon (1994)", 1546=>"Shadow…
julia> g = data[:]
Heterogeneous Graph:
node_types => 2-element Vector{String}
edge_types => 1-element Vector{Tuple{String, String, String}}
num_nodes => Dict{String, Int64} with 2 entries
num_edges => Dict{Tuple{String, String, String}, Int64} with 1 entry
edge_indices => Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 1 entry
node_data => Dict{String, Dict} with 2 entries
edge_data => Dict{Tuple{String, String, String}, Dict} with 1 entry
# Access the user information
julia> user_data = g.node_data["user"]
Dict{Symbol, AbstractVector} with 4 entries:
:age => [24, 53, 23, 24, 33, 42, 57, 36, 29, 53 … 61, 42, 24, 48, 38, 26, 32, 20, 48, 22]
:occupation => ["technician", "other", "writer", "technician", "other", "executive", "administrator", "administrator", "student",…
:zipcode => ["85711", "94043", "32067", "43537", "15213", "98101", "91344", "05201", "01002", "90703" … "22902", "66221", "3…
:gender => Bool[1, 0, 1, 1, 0, 1, 1, 1, 1, 1 … 1, 1, 1, 1, 0, 0, 1, 1, 0, 1]
# Access rating information
julia> g.edge_data[("user", "rating", "movie")]
Dict{Symbol, Vector} with 2 entries:
:timestamp => [881250949, 891717742, 878887116, 880606923, 886397596, 884182806, 881171488, 891628467, 886324817, 883603013 … 8…
:rating => Float16[3.0, 3.0, 1.0, 2.0, 1.0, 4.0, 2.0, 5.0, 3.0, 3.0 … 4.0, 4.0, 3.0, 2.0, 3.0, 3.0, 5.0, 1.0, 2.0, 3.0]
MovieLens 20m dataset
julia> data = MovieLens("20m")
MovieLens 20m:
metadata => Dict{String, Any} with 4 entries
graphs => 1-element Vector{MLDatasets.HeteroGraph}
# There is only 1 graph in MovieLens dataset
julia> g = data[1]
Heterogeneous Graph:
node_types => 3-element Vector{String}
edge_types => 3-element Vector{Tuple{String, String, String}}
num_nodes => Dict{String, Int64} with 3 entries
num_edges => Dict{Tuple{String, String, String}, Int64} with 3 entries
edge_indices => Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 3 entries
node_data => Dict{String, Dict} with 0 entries
edge_data => Dict{Tuple{String, String, String}, Dict} with 3 entries
# Apart from user rating a movie, a user assigns tag to movies and there are genome-scores for movie-tag pairs
julia> g.edge_indices
Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 3 entries:
("movie", "score", "tag") => ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1 … 131170, 131170, 131170, 131170, 131170, 131170, 131170, 131170,…
("user", "tag", "movie") => ([18, 65, 65, 65, 65, 65, 65, 65, 65, 65 … 3489, 7045, 7045, 7164, 7164, 55999, 55999, 55999, 55…
("user", "rating", "movie") => ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1 … 60816, 61160, 65682, 66762, 68319, 68954, 69526, 69644, 70286, …
# Access the rating
julia> g.edge_data[("user", "rating", "movie")]
Dict{Symbol, Vector} with 2 entries:
:timestamp => [1112486027, 1112484676, 1112484819, 1112484727, 1112484580, 1094785740, 1094785734, 1112485573, 1112484940, 111248…
:rating => Float16[3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 4.0, 4.0, 4.0, 4.0 … 4.5, 4.0, 4.5, 4.5, 4.5, 4.5, 4.5, 3.0, 5.0, 2.5]
# Access the movie-tag scores
score = g.edge_data[("movie", "score", "tag")][:score]
23419536-element Vector{Float64}:
0.025000000000000022
0.025000000000000022
0.057750000000000024
⋮
References
[2] TensorFlow MovieLens Implementation
[3] Jesse Vig, Shilad Sen, and John Riedl. 2012. The Tag Genome: Encoding Community Knowledge to Support Novel Interaction. ACM Trans. Interact. Intell. Syst. 2, 3, Article 13 (September 2012), 44 pages. https://doi.org/10.1145/2362394.2362395.
[4] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (January 2016), 19 pages. https://doi.org/10.1145/2827872
MLDatasets.OGBDataset
— TypeOGBDataset(name; dir=nothing)
The collection of datasets from the Open Graph Benchmark: Datasets for Machine Learning on Graphs paper.
name
is the name of one of the datasets (listed here) available for node prediction, edge prediction, or graph prediction tasks.
Examples
Node prediction tasks
julia> data = OGBDataset("ogbn-arxiv")
OGBDataset ogbn-arxiv:
metadata => Dict{String, Any} with 17 entries
graphs => 1-element Vector{MLDatasets.Graph}
graph_data => nothing
julia> data[:]
Graph:
num_nodes => 169343
num_edges => 1166243
edge_index => ("1166243-element Vector{Int64}", "1166243-element Vector{Int64}")
node_data => (val_mask = "29799-trues BitVector", test_mask = "48603-trues BitVector", year = "169343-element Vector{Int64}", features = "128×169343 Matrix{Float32}", label = "169343-element Vector{Int64}", train_mask = "90941-trues BitVector")
edge_data => nothing
julia> data.metadata
Dict{String, Any} with 17 entries:
"download_name" => "arxiv"
"num classes" => 40
"num tasks" => 1
"binary" => false
"url" => "http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip"
"additional node files" => ["node_year"]
"is hetero" => false
"task level" => "node"
⋮ => ⋮
julia> data = OGBDataset("ogbn-mag")
OGBDataset ogbn-mag:
metadata => Dict{String, Any} with 17 entries
graphs => 1-element Vector{MLDatasets.HeteroGraph}
graph_data => nothing
julia> data[:]
Heterogeneous Graph:
num_nodes => Dict{String, Int64} with 4 entries
num_edges => Dict{Tuple{String, String, String}, Int64} with 4 entries
edge_indices => Dict{Tuple{String, String, String}, Tuple{Vector{Int64}, Vector{Int64}}} with 4 entries
node_data => (year = "Dict{String, Vector{Float32}} with 1 entry", features = "Dict{String, Matrix{Float32}} with 1 entry", label = "Dict{String, Vector{Int64}} with 1 entry")
edge_data => (reltype = "Dict{Tuple{String, String, String}, Vector{Float32}} with 4 entries",)
Edge prediction task
julia> data = OGBDataset("ogbl-collab")
OGBDataset ogbl-collab:
metadata => Dict{String, Any} with 15 entries
graphs => 1-element Vector{MLDatasets.Graph}
graph_data => nothing
julia> data[:]
Graph:
num_nodes => 235868
num_edges => 2358104
edge_index => ("2358104-element Vector{Int64}", "2358104-element Vector{Int64}")
node_data => (features = "128×235868 Matrix{Float32}",)
edge_data => (year = "2×1179052 Matrix{Int64}", weight = "2×1179052 Matrix{Int64}")
Graph prediction task
julia> data = OGBDataset("ogbg-molhiv")
OGBDataset ogbg-molhiv:
metadata => Dict{String, Any} with 17 entries
graphs => 41127-element Vector{MLDatasets.Graph}
graph_data => (labels = "41127-element Vector{Int64}", train_mask = "32901-trues BitVector", val_mask = "4113-trues BitVector", test_mask = "4113-trues BitVector")
julia> data[1]
(graphs = Graph(19, 40), labels = 0)
MLDatasets.OrganicMaterialsDB
— TypeOrganicMaterialsDB(; split=:train, dir=nothing)
The OMDB-GAP1 v1.1 dataset from the Organic Materials Database (OMDB) of bulk organic crystals.
The dataset has to be manually downloaded from https://omdb.mathub.io/dataset, then unzipped and its file content placed in the OrganicMaterialsDB
folder.
The dataset contains the following files:
Filename | Description |
---|---|
structures.xyz | 12500 crystal structures. Use the first 10000 as training examples and the remaining 2500 as test set. |
bandgaps.csv | 12500 DFT band gaps corresponding to structures.xyz |
CODids.csv | 12500 COD ids cross referencing the Crystallographic Open Database (in the same order as structures.xyz) |
Please cite the paper introducing this dataset: https://arxiv.org/abs/1810.12814
MLDatasets.PEMSBAY
— TypePEMSBAY(; num_timesteps_in::Int = 12, num_timesteps_out::Int=12, dir=nothing, normalize = true)
The PEMS-BAY dataset described in the Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting paper. It is collected by California Transportation Agencies (Cal- Trans) Performance Measurement System (PeMS).
PEMSBAY
is a graph with 325 nodes representing traffic sensors in the Bay Area.
The edge weights w
are contained as a feature array in edge_data
and represent the distance between the sensors.
The node features are the traffic speed and the time of the measurements collected by the sensors, divided into num_timesteps_in
time steps.
The target values are the traffic speed of the measurements collected by the sensors, divided into num_timesteps_out
time steps.
The normalize
flag indicates whether the data are normalized using Z-score normalization.
MLDatasets.PolBlogs
— TypePolBlogs(; dir=nothing)
The Political Blogs dataset from the The Political Blogosphere and the 2004 US Election: Divided they Blog paper.
PolBlogs
is a graph with 1,490 vertices (representing political blogs) and 19,025 edges (links between blogs).
The links are automatically extracted from a crawl of the front page of the blog.
Each vertex receives a label indicating the political leaning of the blog: liberal or conservative.
MLDatasets.PubMed
— TypePubMed(; dir=nothing, reverse_edges=true)
The PubMed citation network dataset from Ref. [1]. Nodes represent documents and edges represent citation links. The dataset is designed for the node classification task. The task is to predict the category of certain paper. The dataset is retrieved from Ref. [2].
References
[1]: Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking
[2]: Planetoid
MLDatasets.Reddit
— TypeReddit(; full=true, dir=nothing)
The Reddit dataset was introduced in Ref [1]. It is a graph dataset of Reddit posts made in the month of September, 2014. The dataset contains a single post-to-post graph, connecting posts if the same user comments on both. The node label in this case is one of the 41 communities, or “subreddit”s, that a post belongs to. This dataset contains 232,965 posts. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). Each node is represented by a 602 word vector.
Use full=false
to load only a subsample of the dataset.
References
MLDatasets.TemporalBrains
— TypeTemporalBrains(; dir = nothing, threshold_value = 0.6)
The TemporalBrains dataset contains a collection of temporal brain networks (as TemporalSnapshotsGraph
s) of 1000 subjects obtained from resting-state fMRI data from the Human Connectome Project (HCP).
The number of nodes is fixed for each of the 27 snapshots at 102, while the edges change over time.
For each Graph
snapshot, the feature of a node represents the average activation of the node during that snapshot and it is contained in Graphs.node_data
.
Each TemporalSnapshotsGraph
has a label representing their gender ("M" for male and "F" for female) and age range (22-25, 26-30, 31-35 and 36+) contained as a named tuple in graph_data
.
The threshold_value
is used to binarize the edge weights and is set to 0.6 by default.
MLDatasets.TUDataset
— TypeTUDataset(name; dir=nothing)
A variety of graph benchmark datasets, .e.g. "QM9", "IMDB-BINARY", "REDDIT-BINARY" or "PROTEINS", collected from the TU Dortmund University. Retrieve from the TUDataset collection the dataset name
, where name
is any of the datasets available here.
A TUDataset
object can be indexed to retrieve a specific graph or a subset of graphs.
See here for an in-depth description of the format.
Usage Example
julia> data = TUDataset("PROTEINS")
dataset TUDataset:
name => PROTEINS
metadata => Dict{String, Any} with 1 entry
graphs => 1113-element Vector{MLDatasets.Graph}
graph_data => (targets = "1113-element Vector{Int64}",)
num_nodes => 43471
num_edges => 162088
num_graphs => 1113
julia> data[1]
(graphs = Graph(42, 162), targets = 1)
MLDatasets.WindMillEnergy
— TypeWindMillEnergy(; size, normalize=true, num_timesteps_in=8, num_timesteps_out=8, dir=nothing)
The WindMillEnergy dataset contains a collection hourly energy output of windmills from a European country for more than 2 years.
WindMillEnergy
is a graph with nodes representing windmills. The edge weights represent the strength of the relationship between the windmills. The number of nodes is fixed and depends on the size of the dataset, 11 for small
, 26 for medium
, and 319 for large
.
The node features and targets are the number of hourly energy output of the windmills. They are represented as an array of arrays of size (1, num_nodes, num_timesteps_in)
. In both cases, two consecutive arrays are shifted by one-time step.
Keyword Arguments
size::String
: The size of the dataset, can besmall
,medium
, orlarge
.normalize::Bool
: Whether to normalize the data using Z-score normalization. Default istrue
.num_timesteps_in::Int
: The number of time steps, in this case, the number of hours, for the input features. Default is8
.num_timesteps_out::Int
: The number of time steps, in this case, the number of hours, for the target values. Default is8
.dir::String
: The directory to save the dataset. Default isnothing
.
Examples
julia> using JSON3
julia> dataset = WindMillEnergy(;size= "small");
julia> dataset.graphs[1]
Graph:
num_nodes => 11
num_edges => 121
edge_index => ("121-element Vector{Int64}", "121-element Vector{Int64}")
node_data => (features = "17456-element Vector{Any}", targets = "17456-element Vector{Any}")
edge_data => 121-element Vector{Float32}
julia> size(dataset.graphs[1].node_data.features[1])
(1, 11, 8)