During my 3 years working as a Network Engineer on a major telco company core infrastructure, I had the opportunity to automate operations impacting hundreds of physical network devices from 3 different vendors (Arista, Cisco, Juniper) using 2 different automation tools (Ansible, Python). The most important parts aren't the practical and technical items presented here, the most important parts are the time saved as a result and the human bonds built along the way.
I never thought about QoS too much until I realized that .1% of downtime means only 8.76h of unavailability per year, and that the services using the infrastructure can be critical: T-Mobile lost 20k emergency 911 calls and 250M regular calls during a 12 hours outage in 2020!
I was fortunate enough to work with software engineers that helped me with to write Ansible playbooks and using Git was inevitable in order to version, share and merge code.
I was also fortunate enough to work with security experts that helped me understand the many risks associated with the application of 100 security rules on a network infrastructure as well as the automation I was tasked to do.
I avoided disasters by automating recon before action: I had to only allow SSH or HTTPS connections without blocking the console port if it was the only working one, I had to prevent Smurf attacks without blocking directed IP broadcast packets if someone used them, etc.
I used the company's AWX to simplify Ansible usage (check mode, credentials, git path, inventory, verbosity, forks) and wrote a custom playbook per security rule, associated to one job template, to apply on hundreds of devices whose name and IP address were stored in one inventory file.
One downside with having multiple vendors in an infrastructure is the need to write to code variant for each of them.
I wrote custom Ansible playbooks for Juniper devices with their specific Commit and Rollback feature.
- control:
- control_command:
- show: show configuration | match system | display set
state:
- present
control_regex:
- system services ssh
- system services ssh protocol-version v2
- system login retry-options tries-before-disconnected 3
commands:
- edit
- set system services ssh
- set system services ssh protocol-version v2
- set system services ssh retry-options tries-before-disconnected 3
- commit comment "Safe Access using Ansible"
commands_rollback:
- edit
- rollback 0
- commit comment "Rollback - Safe Access using Ansible failed"
I took part in the migration of 300 network switches from a Juniper QFabric infrastructure to a spine-leaf Arista IP Fabric infrastructure that took 1 year to accomplish instead of 2 thanks to automation.
Choosing the new infrastructure required a RfQ followed by several interview rounds with several vendors to clarify technical (ZTP, QoS, DoS prevention) and financial (CapEx, OpEx, Support, RMA) matters.
We used the devices handling the less critical traffic as pilots while knowing that critical traffic would require migrating at night and on week-ends, hence the need to automate tasks.
I discovered the importance of communication as every operation had to be announced in advance and that a detailed report of what happened was expected afterwards.
Automation didn't remove work but changed its nature: I went from running commands in order to gather information and configure devices to only monitoring graphs, while external communication remained unchanged.
The key part of the migration was too lose as few packets as possible, which can happen every time the master of a pair of devices changes, hence the need to be able to revert back to a working state quickly.
I was involved in testing every automation iterations and was responsible for the automation of OoB switches configuration using Ansible.
--- dict.yml ---
- name: OoBSwitch01
ip: 1.2.3.4
interfaces:
- id: ge-0/0/1
desc: AristaSwitch01
- id: ge-0/0/2
desc: AristaSwitch02
--- configuration.yml ---
- name: Configure OoB Switches
hosts: all
gather_facts: False
connection: netconf
vars:
oob: "{{ lookup('file', 'dict.yml') | from_yaml }}" # Loads data from dict
tasks:
[...]
- name: Configure OoB Switches' Interfaces
junos_l2_interface:
name: "{{ item.1.id }}"
description: "{{ item.1.desc }}"
access_vlan: vlanadmin
loop: "{{ oob|subelements('interfaces') }}" # Loop through interfaces ...
when: inventory_hostname == item.0.name # ... only if the hostnames match
I was tasked to verify a parameter on every equipment, change its value if it wasn't right, and return a global status as well as operations history.
I crashed one of my team's bastion host by starting too many SSH connections in parallel without taking its capacity into account.
I discovered the diversity of errors: the only allowed session is in use (human), the physical link is down (physical), the device is in a tainted state (software).
I had no influence until I created a simple pie chart showing the overall state of the infrastructure.
# Gather information and start multiprocessing:
# import libraries
import multiprocessing
from jnpr.junos
# collect login credentials
username = raw_input("Username:")
password = getpass.getpass()
# write in a global dict while multiprocessing
resources = multiprocessing.Manager().dict()
# import the inventory
with open("inventory.txt", "r") as f:
devices_list = f.readlines()
devices_list = [x.strip() for x in list]
# start your fonction x times in parallel
p = multiprocessing.Pool(x)
p.map(custom_function, devices_list)
p.close()
# Store the results
with open("results.txt", "w") as f:
for device, status in resources.items():
f.write("%s %s\n" % (device, status))
# Analyze and modify a parameter in a custom function:
# connect to the device using SSH
def custom_function(device):
try:
device = Device(host=device, user=username, password=password, port=22).open()
# probe the current state, using a public NTP in this example
status = device.cli("show ntp status")
# compare it to the expected state
if ("expected_config") in status:
resources [device.facts["hostname"]] = "OK"
# apply changes
else:
with Config(device) as change:
change.load(template_path="new_config.txt", template_vars="config_vars.txt", format="set", merge=True)
change.commit(comment="ntp update script")
# wait x seconds then re-compare
time.sleep(x)
status = device.cli("show ntp status")
if ("expected_config") in status:
resources [device.facts["hostname"]] = "Changed"
else:
resources [device.facts["hostname"]] = "KO"
# Handle errors
except:
resources [device.facts["hostname"]] = "Error"