Example¶
We have three datasets:
- Job queues describing what jobs ran on what nodes over what time ranges
- Data describing which nodes reside on which racks
- Collected FLOPs at different points in time for each node
Now, lets say we want to analyze how the FLOPs of different jobs are distributed over a rack. This requires integrating all three datasets in a non-trivial way. With ScrubJay, we specify the data columns that we want and ask ScrubJay to generate solutions containing them.
First, we load a DataSpace which describes the dimensions and datasets we are using.
val dataSpace = DataSpace.fromJsonFile("jobAnalysis.sj")
Then, we create a query target, which is just the schema of the dataset that we want, and use it to create a query in the created dataspace.
val querySchema = ScrubJaySchema(Array(
ScrubJayField(domain = true, dimension = "rack"),
ScrubJayField(domain = false, dimension = "flops")
))
val query = Query(dataSpace, querySchema)
todo: Queries may be generated using the ScrubJay Query Language.
We find solutions to the query (there may be multiple) using:
val solutions = query.solutions
This gives us all solutions as a lazily-evaluated iterator.
A solution is a DatasetID, which describes the resulting dataset and how to derive it.
To derive a solution as a Spark DataFrame, run the realize
function on it, specifying the dimension space used
to create the query.
For example, to derive the first solution:
val realizedDataFrame = solutions.head.realize(dataSpace.dimensions)
We can also see how the solution was derived in a DAG, using toAsciiGraphString
:
DatasetID.toAsciiGraphString(solution)
which produces:
┌──────────────────────┐
│ CSVDataset │
│domain:job:identifier,│
│ domain:node:list, │
│ value:time:seconds, │
│ domain:time:range │
└───────────┬──────────┘
│
┌──────────────────────────┘
│
v
┌───────────────────────┐ ┌──────────────────────────┐
│ ExplodeDiscreteRange │ ┌───────────────────────┐ │ CSVDataset │
│domain:job:identifier, │ │ CSVDataset │ │ domain:node:identifier, │
│domain:node:identifier,│ │domain:node:identifier,│ │domain:time:datetimestamp,│
│ value:time:seconds, │ │domain:rack:identifier │ │ value:flops:count │
│ domain:time:range │ └───────────┬───────────┘ └───────┬──────────────────┘
└───────────────────┬───┘ │ │
│ └────────────┐ │
│ │ │
v v v
┌─────────────────────────┐ ┌──────────────────────────┐
│ ExplodeContinuousRange │ │ NaturalJoin │
│ domain:job:identifier, │ │ domain:node:identifier, │
│ domain:node:identifier, │ │ domain:rack:identifier, │
│ value:time:seconds, │ │domain:time:datetimestamp,│
│domain:time:datetimestamp│ │ value:flops:count │
└────────────┬────────────┘ └┬─────────────────────────┘
│ │
└─────────────┐ │
│ │
v v
┌──────────────────────────┐
│ InterpolationJoin │
│ domain:job:identifier, │
│ domain:node:identifier, │
│domain:time:datetimestamp,│
│ domain:rack:identifier, │
│ value:time:seconds, │
│ value:flops:count │
└──────────────────────────┘