Traffic Dictator v1.5 Release Notes

Summary

Traffic Dictator version 1.5 has been released on 08.05.2025. This article describes changes in the new version.

New feature: Controller Redundancy

Now it is possible to synchronize parts of the configuration between 2 instances of TD. There is no need for any state sync in SR-TE; and config synchronization can be achieved without this feature – e.g. if you configure a network automation tool to add the same config to multiple TD instances.

However, if the operator prefers to configure TD manually using CLI or GUI, this new feature can be useful so that any SR-TE config changes on one controller are automatically replicated to the other one.

As of 1.5, the following config sections are synchronized:

  • traffic-eng affinities
  • traffic-eng policies
  • traffic-eng peer-groups
  • traffic-eng explicit-paths

New config commands:

management redundancy
   !
   key 
   role [master|backup]
   neighbor <ipv4|ipv6>

Typical redundancy designs

BGP-LU and PCEP require session with each router where traffic engineering policies are pushed. Therefore, each router will have a session with each controller and will receive 2 copies of an SR-TE policy. Any SR-TE config changes will be synchronized between master and backup TD instances.

With BGP-SRTE, it is possible to configure sessions between TD and a route reflector, which will propagate SR-TE policies to other routers. There is no need to have a BGP session with each router. Master and backup TD instances are configured with different SR-TE distinguisher (“router general” section is not syncable so different controllers can have different config), and the route reflector will receive and propagate 2 different BGP-SRTE NLRI. Therefore, if one of the controllers fails, the router will already have an SR-TE policy from the second controller, so there will be no interruption.

Redundancy config example

Master config:

management redundancy
   key my_redundancy_key
   role master
   neighbor 172.17.0.2

Backup config:

management redundancy
   key my_redundancy_key
   role backup
   neighbor 172.17.0.1

How config sync works

The backup instance initates connection on TCP port 2011 to the master instance. Initially config_version is set to 0, and after any redundancy config changes, config_version is reset to 0. When config_version is 0 on both ends, the backup TD deletes syncable config sections (see above) and receives config from master. Any changes in syncable config section (either on master or backup) are synchronized to the other peer, incrementing config_version upon each change.

Failure scenarios

  1. Backup TD fails: when backup comes up again, it will sync config from master.
  2. Master TD fails: when master comes up again, master config_version will be 0, but backup config_version is higher, so the master will sync config from backup.
  3. Split brain: when session comes up again, both backup and master will have config_version other than 0. In this case, backup will sync config from master, effectively deleting all config changes made on backup since split brain occurred. Therefore, it is recommended to make any config changes only on master.

In the future versions, it is possible that config changes on backup will be locked as long as redundancy session is active. Right now they are allowed, but not recommended.

Verification and troubleshooting

One TD must be configured as master and the other as backup, and communication on tcp/2011 must be allowed between the 2 controllers.

Redundancy key must match (this is to protect from misconfigurations).

Both TD instances must have the same software version. During software upgrade, it is ok to break redundancy – it is not a critical element, it just synchronizes config changes. There is no state sync between TD instances.

Verify redundancy status:

TD1#sh redundancy
Redundancy session statistics

  Role:                          backup              
  Key:                           my_redundancy_key   
  Neighbor IP:                   172.17.0.1          
  Redundancy server running:     True                
  Config version:                1                   
  Command server queue:          0                   
  Greenthreads available:        996                 
  Config changes queued:         0                   
  Running sessions count:        1                   
  Running sessions:              ['172.17.0.1']      

Redundancy neighbor is 172.17.0.1, local IP 172.17.0.2
  Redundancy version 15
  Last read 0:00:06, last write 0:00:22
  Hold time is 120, keepalive interval is 30 seconds
  Hold timer is active, time left 0:01:54
  Keepalive timer is active, time left 0:00:08
  Connect timer is inactive
  Idle hold timer is inactive
  Session state is Established, up for 0:44:53
  Number of transitions to established: 1
  Last state was OpenConfirm
                         Sent       Rcvd
    Opens:                  1          1
    Updates:                0          1
    Closes:                 0          0
    Keepalives:            90         91

    Total messages:        91         93

Debug command:

TD1#debug redundancy ?
  

Policy engine improvements

Policy debugging

Thanks to log-reload rust crate, it is now possible to enable detailed debugging to troubleshoot policy engine issues.

Debugs:

TD1#debug traffic-eng policy ?
  server               Policy server debug
  engine               Policy engine debug
  name                 Debug a specific policy

Debug a specific policy (or all policies):

TD1#debug traffic-eng policy name ?
  <POLICY_NAME|*>      Debug traffic engineering policy calculation

Policy debug example

TD1#debug traffic-eng policy name R1_ISP5_BLUE_ONLY_IPV4
Enabled debugging for Policy R1_ISP5_BLUE_ONLY_IPV4
TD1#clear traffic-eng *
Requested manual reoptimization of all policies

Check debugs:

TD1#show logg | grep R1_ISP5_BLUE_ONLY_IPV4

2025-05-08 09:37:23,327 TD1 WARNING: Policy-server: Enabling debug for Policy R1_ISP5_BLUE_ONLY_IPV4
2025-05-08 09:37:23,486 TD1 WARNING: Policy-engine: Enabled debug for policy R1_ISP5_BLUE_ONLY_IPV4
2025-05-08 09:37:32,530 TD1 DEBUG: Policy-engine: Starting calculating policy R1_ISP5_BLUE_ONLY_IPV4
2025-05-08 09:37:32,633 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: resolving headend
2025-05-08 09:37:32,633 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: resolved headend to 0001.0001.0001.00, protocol isis, topology_id 101
2025-05-08 09:37:32,633 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: resolving SRLB range
2025-05-08 09:37:32,633 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: resolved SRLB base 15000, range 1000
2025-05-08 09:37:32,633 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: checking candidate path
2025-05-08 09:37:32,633 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: calculating dynamic candidate path 100 
2025-05-08 09:37:32,633 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: dynamic candidate path 100 - resolving SID structure
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: dynamic candidate path 100 - generating segment lists
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: dynamic candidate path 100 - attaching EPE label
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: dynamic candidate path 100 - found EPE label 24015, local_ip 10.100.20.11, remote_ip 10.100.20.105
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: dynamic candidate path 100 - checking MSD
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: dynamic candidate path 100 - reserving bandwidth
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: successfully calculated candidate path 100
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: generating route_key
2025-05-08 09:37:32,634 TD1 DEBUG: Policy-engine: Policy R1_ISP5_BLUE_ONLY_IPV4: generated route_key [96][16777220][4][10.100.20.105]

Verify policy by key:

TD1#show pcep ipv4 sr-te [96][16777220][4][10.100.20.105]
PCEP SR-TE routing table information

PCEP routing table entry for [96][16777220][4][10.100.20.105]
    Policy name: R1_ISP5_BLUE_ONLY_IPV4
    Headend: 1.1.1.1
    Endpoint: 10.100.20.105, Color 4
    Install peer: 192.168.0.101
    Last modified: May 08, 2025 09:37:33
      Route acked by PCC, PLSP-ID 2
        LSP-ID     Oper status
             2      Active (2)
      Metric type igp, metric 40
      Binding SID: 15004
      ENLP: "none", Override: True
      Segment list: [16010, 24013, 24015]

Limitations

Due to kafka rust crate limitations, it is not yet possible to print debugs from all functions I would like to. This will be hopefully solved in the future.

Path failure reason

Prior to 1.5, if a candidate path failed, TD sometimes would give reason (e.g. invalid config or can’t find headend or endpoint), but often it gave a generic error “Error when resolving segment list”:

TD1#show traffic-eng policy R1_ISP5_BLUE_ONLY_IPV4 detail 
Detailed traffic-eng policy information:

Traffic engineering policy "R1_ISP5_BLUE_ONLY_IPV4"

    Valid config, Reason failed: All candidate paths failed
    Headend 1.1.1.1, topology-id 101, Maximum SID depth: 10
    Endpoint 10.100.20.105, color 4

    Setup priority: 7, Hold priority: 7
    Install direct, protocol pcep, peer 192.168.0.101
    Policy index: 4, SR-TE distinguisher: 16777220
    Binding-SID: 15004

    Candidate paths:
        Candidate-path preference 100
            Path config valid
            Metric: igp
            Path-option: dynamic
            Affinity-set: BLUE_ONLY
                Constraint: include-all
                List: ['BLUE']
                Value: 0x1
            Path failed, reason: Error when resolving segment list

    Policy statistics:
        Last config update: 2025-05-07 17:40:59,251
        Last recalculation: 2025-05-07 17:41:44.405
        Policy calculation took 0 miliseconds

Now path failure reason is more verbose; for example:

TD1#show traffic-eng policy R1_ISP5_BLUE_ONLY_IPV4 detail
Detailed traffic-eng policy information:

Traffic engineering policy "R1_ISP5_BLUE_ONLY_IPV4"

    Valid config, Reason failed: All candidate paths failed
    Headend 1.1.1.1, topology-id 101, Maximum SID depth: 10
    Endpoint 10.100.20.105, color 4

    Setup priority: 7, Hold priority: 7
    Install direct, protocol pcep, peer 192.168.0.101
    Policy index: 4, SR-TE distinguisher: 16777220
    Binding-SID: 15004

    Candidate paths:
        Candidate-path preference 100
            Path config valid
            Metric: igp
            Path-option: dynamic
            Affinity-set: BLUE_ONLY
                Constraint: include-all
                List: ['BLUE']
                Value: 0x1
            Path failed, reason: SPF failed

    Policy statistics:
        Last config update: 2025-05-07 17:55:55,717
        Last recalculation: 2025-05-07 17:56:31.301
        Policy calculation took 0 miliseconds

This means CSPF with the given constraints is not possible in the topology.

Another example:

TD1#show traffic-eng policy R1_ISP5_BLUE_ONLY_IPV4 detail
Detailed traffic-eng policy information:

Traffic engineering policy "R1_ISP5_BLUE_ONLY_IPV4"

    Valid config, Reason failed: All candidate paths failed
    Headend 1.1.1.1, topology-id 101, Maximum SID depth: 10
    Endpoint 10.100.20.105, color 4

    Setup priority: 7, Hold priority: 7
    Install direct, protocol pcep, peer 192.168.0.101
    Policy index: 4, SR-TE distinguisher: 16777220
    Binding-SID: 15004

    Candidate paths:
        Candidate-path preference 100
            Path config valid
            Metric: igp
            Path-option: dynamic
            Affinity-set: BLUE_ONLY
                Constraint: include-all
                List: ['BLUE']
                Value: 0x1
            Path failed, reason: Unable to get Prefix SID for node 0010.0010.0010.00

    Policy statistics:
        Last config update: 2025-05-07 17:55:55,717
        Last recalculation: 2025-05-07 17:59:51.060
        Policy calculation took 0 miliseconds

This means TD was able to calculate CSPF, but to steer traffic over the path, it needs a prefix SID from node 0010.0010.0010.00, and such prefix SID is not available.

Bug fixes

1. When TD container is stopped and started, sometimes kafka fails to start (bug #42). Race condition in kafka, configured to restart kafka on failure.

2. Incorrect display of uptime in CLI (bug #43). Only happens with >1 month uptimes, and only affects CLI. Changed display format.

Download

You can download the new version of Traffic Dictator from the Downloads page.

Leave a Comment