Systematising troubleshooting of disputes in network

Received Jun 2, 2020 Revised Oct 29, 2020 Accepted Nov 27, 2020 With the growing network size, virtualization everywhere, it is getting more difficult to configure and manage the network devices. Software Defined Networking (SDN) is a way to address these problems. Application Centric Infrastructure (ACI) is the Cisco’s solution to SDN, with centralized automation and policy-driven application profiles. If there is any bug in the network or problem with the expected functionality of the network, ACI cases are opened in the Technical Assistance Centre (TAC) for troubleshooting the issue. Engineers currently troubleshoot ACI cases manually by using Command Line Interface (CLI) and trace for different events triggered by the policy pushes by logs generated at different stages of the ACI and from different servers responsible for this, which indeed is a very tedious, time consuming task and is prone to manual errors. This paper describes a way to automate the entire ACI troubleshooting process with the user-friendly GUI which can show the entire information needed for troubleshooting by extracting relevant information at every layer. By making use of FSM models the proposed solution can be extended to other areas which involve log analysis using CLI to extract relevant information and is not just limited to ACI.


INTRODUCTION
The Software Defined Networking (SDN) is a networking approach that separates data plane and control plane which enables the data plane devices to be directly programmable by the control plane and the infrastructure below to be abstracted for network and application services. Application Centric infrastructure (ACI) is SDN solution of Cisco. While troubleshooting or solving an ACI related case, Distributed Management Environment (DME) logs are considered for audit. ACI is adopted in networking because it is characterized by simplified automation by an application-driven policy model, real-time centralized visibility, open software flexibility, scalable performance and multi-tenancy in hardware etc, but the complexity involved in solving an ACI related case is reflected in the logs itself, which are usually the outputs of Managed Objects (MOs) arranged systematically in files and folders. Assuming a problem started at a time and is associated with the configuration made by the person who has configured the network, this configuration that is pushed is not localized to just one device, instead it is pushed to all other nodes that might be affected by the ACI which includes redundancy measures also. To exactly identify the problem usually 4 different kind of files are considered: Access logs, Nginx, Policydistr, Policymgr which are associated with ACI policy pushes. These 4 logs give us a description of how the config was distributed  [1][2][3][4]. But each log is an output of different managed objects running asynchronously including the redundant measures too. As the systems get more intelligent and complex, the outputs given by each process also gets increasingly complex. Having less readable complex and inherently difficult logs is not only a problem for customer alone, nstead increases traffic at TAC centers because even a minor problem may create incomprehensible log output. From this project, device configuration pushes on ACI products can be visualized with a GUI, without having to look for data in a Command Line Interfaces (CLI). The project is built on dynamic roots with scalable and reliable methodologies like FSM and JSON representation, so that it can be scaled further on stretching to other networking technologies other than ACI wherever log parsing and visualization is done. The aim is to incorporate the project into app section of APIC GUI itself so that the time involved in downloading and decrypting them can be narrowed down.

SOFTWARE DEFINED NETWORKING AND PERFORMANE OF ACI 2.1. Necessity and features of SDN
There is a necessity of connected devices with the growing technological demands. As a result, networks and internet are expanding at the increasing rate. With such a huge network size, it is hard to configure, control and manage network devices. Software Defined Networking (SDN) architecture provides a solution to this by separating the control and data forwarding capabilities of networking devices. In traditional network, these capabilities were vertically stacked on the devices and hence not giving enough flexibility [5][6][7][8]. SDN not only helps in configuring devices by a software method, but also helps in the maintenance and management of the network itself in case anything goes wrong with device or any faults arise etc. since the whole control plane capabilities is vested in the hands of the centralized controller.
To facilitate this, it makes use of two types of Application Policy Interfaces (APIs) viz Southbound APIs and Northbound APIs. The Southbound APIs are responsible for providing the instruction sets to the functionality of forwarding devices. It also ensures smooth communication control plane devices and forwarding devices by defining communication protocol between them, and hence are the important elements in separating data plane and control plane capabilities of the devices. Whereas, the Northbound APIs are used by the application developers for developing applications relevant to the network. In most of the cases, northbound APIs program the devices responsible for forwarding by abstracting the instruction-set used by southbound APIs, at a lower level [9][10][11][12][13]. Also, network hypervisors are responsible for layer III functionality of the SDN by supporting various virtual machines to share a common hardware resource.

Features of Application Centric Infrastructure
Combines the full mesh spine-and-leaf topology with finely tuned routing and several protocols that allow network abstraction from physical infrastructure. This flexible and redundant network enables grouping of hosts to apply communication policies that define the device the devices that can communicate with each other. Even though ACI uses a completely different communication model from the outside network, the entire ACI topology appears as a regular switch to the outside devices. At a very basic level, ACI can be thought of as a Clos network of Nexus 9000 Series Switches with an Application Policy Infrastructure Controller (APIC) management platform. In a Clos network, every lower tier switch is connected to every upper tier switch, creating a redundant, high performance mesh where traffic is distributed between links [14][15][16][17][18].
The APIC, or network management platform, provides customers with a single place from which to manage the network. Like the Unified Computing System (UCS), the APIC uses abstraction to ease the configuration burden, while pushing down profiles that are tailored to applications that will be instantiated on the individual switches. In doing so, ACI solution becomes much more than a Clos network of Nexus 9000 Series Switches with the management platform [19][20][21]. Instead, it becomes a fully integrated, open system. ACI supports the same access tools that are available on individual Nexus Series Switches with a management platform. Instead, it becomes a fully integrated, open system.

MANAGED OBJECTS AND DISTRIBUTED MANAGED INFORMATION TREE FOR TROUBLESHOOTING
Classes are used to abstract and modeling everything include policy, resource, topology, event and faults in ACI. Each object is an instance of the certain class that inherits the properties and method. Together all objects store the configuration and run-time data of the ACI fabric. The following gives the description of how managed objects (MOs) are organized. Everything is an object. Objects are hierarchically organized. Distributed Managed Information Tree (dMIT) contains comprehensive system information like discovered components, system configuration and operational status including statistics and faults. A single logical dMIT presented to user through REST interface on any APIC. Internally dMIT is split into various services and shards in various APICs. In APIC MIT, everything is fully defined as an object. These definitions include the entity itself, and a full description of the entity, including its configuration, state, run-time data, description, reference objects, and lifecycle position. The tree is distributed across the entire APIC cluster, so it may be referred to as the DMIT. The objects are organized in a hierarchical way, creating logical object containers. Every object has a parent, except the top object, root, which is the top of the tree. Objects include a class, which describes the type of object such as a port, network path, or Packages identify the functional areas to which the objects belong.
The DMIT also contains comprehensive system information about discovered components, system configuration, and operational status including statistics and faults. The unified RESTful API automatically delegates a request to the corresponding components in the MIT. Every component and attribute are represented as an object in the Cisco APIC MIT and every object can be manipulated via REST.

RESULTS AND INFERENCE
Depending on the affected node, timestamp of the event, changeset etc. the steps involved in automating the troubleshooting process varies. Results for the changeset "conversion from proxy to flood" follows the steps of finding timestamp from aaaModLR server, pull out the POST URL from the JSON file, collect message ID, transaction ID etc. where each data extraction from the files is dependent on the previously collected data. Hence, a Finite State Machine (FSM) is built where these procedures can be defined depending on the previous triggers.
These FSMs play an important role in automating the entire procedure. Figure 1. shows how a typical FSM looks like for the data extraction from Nginx server, where the inputs are HTTP method, timestamp and the affected node, the output is the Message ID which will be later used for the Policy distributer server. These FSMs in the form of a JSON file can be directly imported to the python script responsible for log parsing. To create the above-mentioned FSM and the FSMs related to other configuration pushes, a web UI is developed, where one can define the FSM states, add events and add relevant call backs to define the set of operations which have to follow one another on trigger [22][23][24][25].

35
The required information at different stages of the ACI policy push configuration be viewed easily as the data flow path is clearly defined and relevant information and keywords are displayed clearly. While traditionally, engineers used to grep for the required data across lots of showtech files and folders with the help of Command Line Interface (CLI). As it is hovered over every node in the GUI, relevant information at that stage is displayed. The suitable JSON files corresponding that stage can also be viewed by zooming into that node of the GUI.
From the above-mentioned results, it can be inferred that the entire ACI troubleshooting process can be automated with the user-friendly GUI which can show the entire information needed for troubleshooting by extracting relevant information at every layer of the ACI. This information needed is not a general one which is common to all the cases, hence this issue is addressed by making use of finite state machines which considers all the possible cases and the data flow path is defined for each of them. By making use of FSM models the proposed solution can be extended to other areas which involve log analysis using CLI to extract relevant information and is not just limited to ACI. Different FSMs can be designed to suite different applications and finally a JSON file will be generated which has all the relevant information in the form of a dictionary. Finally, all the required data can be viewed in a readable format by uploading this JSON file to the GUI other than doing it manually. Some notable features of this automation tool are to Reduce time complexity while increasing value addition to the ACI team Scalable to other technologies.

CONCLUSION
ACI architecture ensures agility, programmability, better performance and reduced complexity. This is based on the application centric model where application requirements decide the network not the other way around. As is a result it is the future to all the networking topologies as it doesn't include configuration of individual physical devices explicitly, rather are automatically done whenever new policy push is done to the application policy infrastructure controller, which saves a great amount of time and eradicates the errors which may occur through manual configuration. This work was an attempt to save the time of engineers even more during the troubleshooting process of the issues raised due to the configurational changes or policy pushes made to the ACI fabric. Generally solving an ACI case approximately takes two weeks to grep through all the logs for audit. By using the above described automation tool, this process is going get faster and more reliable.