![]() |
This is part 2 of a multi-part series that describes a “side-project” a small team of us (Jean Pierre (JP), Alan Rajapa and Massarrah Tannous and I) have been working on for the past couple of years. You’ll note that in part 1 of the series, I was referring to a similar concept as “Zero Touch Storage Provisioning”. The reason for the name change was that along the way, we figured out that we were trying to provision WAAAY more than just storage, so we changed the name to “Zero Touch Infrastructure Provisioning” (ZTIP). Before we begin, if you’d like to get an idea of the overall concept, as well as see a snapshot of where we were in the journey about 18 months ago, please see this video that was put together by Massarrah and JP. We (ok, I) like quirky names, so please do not hold the name of our controller “Orchestration System for Infrastructure Management” (OSIM) against them. :-) Underlays and overlaysOur work rests on the idea that Infrastructure as a Service (IaaS) can be logically broken down into at least 2 layers; the IaaS overlay and the IaaS underlay. The IaaS OverlayMost of you are probably very familiar with the IaaS Overlay and the IaaS Overlay Management and Orchestration (M&O) software used to control it. A couple of examples would be VMware vRealize Automation, OpenStack and whatever Amazon is using to orchestrate Amazon Web Services (AWS) more specifically EC2. Based on the work we’ve done and the research we’ve seen performed by others in the industry, I believe the IaaS Overlay is all about the well-known axiom “Abstract, Pool, Automate”. Judging by the solutions I see available for use in the Enterprise, I think many others would say the same. The IaaS UnderlayMost of you are probably not as familiar with the IaaS Underlay and the IaaS Underlay M&O software used to control it. I would LOVE to provide examples, but we’ve been working in this space because we haven’t found a solution (suitable for on-premises use in the Enterprise) that does everything we need. And by everything, I mean everything in the red (dashed) box shown below. The IaaS Underlay explainedThe diagram above can be broken down into:
An aside: You might ask “what happened to the “Storage” column you were showing in the previous blog post?” and that’s a phenomenally interesting story that will have to wait for my “post-retirement” book. That said, the removal of the storage column is one reason for the name change to ZTIP. The other primary reason is; we’ve been focusing on Hyper-converged solutions for full stack automation. This is because the concept of automation is something that traditional enterprises still seem to be evaluating, whereas the HCI community seems to have embraced it fully. For the remainder of this blog series, I’ll explain each of the layers (rows) in the above diagram and I’ll start from the bottom and work my way to the top. Before I continue, I’d like to share an observation that was made during the course of our work. Essentially, we noticed that the lower we went in the stack, the harder it was to automate. I’ll provide more detail about this when I get up to the mapping layer case study, but I think this is a big reason why so few people have attempted to fully automate the IaaS underlay. The Physical LayerAlthough everything ultimately runs on physical resources, I don’t consider the physical configuration of the components to be within the domain of the IaaS Underlay M&O controller. That said, we should at least mention that fact that before any of these components can be configured, each of them will need to be Racked, Cabled and Powered (R,C,P). This is a process that will be performed by a person, at least until the singularity, and at that point Robots will be people too (and even they will probably be asking “isn’t there some way we can automate this?”). Bootstrap – Node CreationOnce the nodes have been Racked, Cabled and Powered, a body of work comes into play that I’ll refer to as Composable Systems. The basic idea is that you will eventually be able to dynamically select CPU, Memory, Storage, GPUs, etc from pools of resources and then instantiate a “virtual” bare metal server that has exactly the right requirements for your application. It’s an area that is still in its infancy but this blog post by Dell’s Bill Dawkins contains some great additional information. Because this area is still so new, I don’t currently include it when I talk about the IaaS underlay. That said, once a “Server Builder” API is available, it would make sense to include it. Bootstrap – InventoryToday, the lowest layer of the IaaS Underlay is the Bootstrap Inventory layer and the first bit of configuration that will need to be done in this layer is to configure the network. Network Configuration (Auto-config Leaf/Spine + gather LLDP)As will become clear as we move up the IaaS stack, there are all kinds of causality dilemmas (chicken or the egg scenarios) when trying to bootstrap Infrastructure and many of them can be solved by understanding how the elements you are trying to configure are related to one another, or put another way, how they are interconnected. I refer to these interconnectivity details as “topology information” and to properly understand the topology, I believe it makes sense to use the network as the source of truth. (h/t to Nick Ciarleglio from Arista networks for this insight) However, before we can understand the topology, we first need to configure the network elements that will be providing the connectivity and hence we have our first causality dilemma (e.g., how do we configure the network if we don’t know exactly what it will be used for?) One approach that can be used is to configure the network in stages, and the first phase is something that I’ve been referring to as “IP Fabric formation”. IP Fabric formation is basically just a way to say we are going to configure the switches so that they have basic connectivity between themselves. With regards to the IP Fabric formation process itself, there seem to be three primary ways to accomplish this task:
We’ve done it all three ways:
End device discovery (ID+INV Advertise LLDP)Once the switches have a basic configuration on them and basic connectivity established between them, you can do a couple of very interesting things:
So with the above in mind, let’s look at an example that describes (at a high level) some of the work we’ve done in this space. Network Configuration ExampleThe following configuration of Compute and Network resources will be used throughout this blog post series. This configuration consists of:
Overall Assumptions
Phase 1: IP Fabric formationPhase 1 assumptions
Topology discovery
IP Fabric configuration
Note: Initially all end device interfaces (e.g., eth7 and above) could be put into a default VLAN (e.g., 4001) for the purposes of PXE boot and inventory. This will allow the hosts to obtain an IP Address, PXE boot and then perform inventory.
At this point the IP Fabric has been formed and the compute resources should be able to PXE boot and download the RackHD microkernel to start the inventory process. We will assume that once the inventory process has completed, the capabilities of each node are as shown below. Note that each rack contains homogeneous node types but this won’t typically be the case. Also note that “GPU” indicates that the node contains GPUs, while “Storage” and “Compute” indicate that the nodes are either Storage “heavy” or Compute “heavy”. The remainder of the network configuration process (e.g, Slice creation) will be handled as workloads are on-boarded and will be discussed in the next blog post. Thanks for reading! |
