Network troubleshooting guide

Technical users

This guide is for customer network teams.

This guide outlines the collaborative steps required to diagnose and resolve network performance issues. It is organized as a prioritized flowchart, starting with the simplest, most common issues before moving to more complex investigations.

NOTE

The information in this guide includes step to debug your network with best practices for cloud solutions. You should run through these steps prior to contacting the Tulip support team for networking issues.

Prerequisite: Review application best practices

Before proceeding with a deep network investigation, it's essential to ensure your application is designed for optimal performance. Many performance issues can be resolved by addressing application design, such as high-frequency API calls or inefficient triggers.

Review the guide on application design here.

If your app already follows these guidelines, proceed with the following network investigation phases and subsequent steps:

Phase 1: Foundational checks

Quickly rule out local, client-side, and physical layer issues with minimal effort.

1. Run current diagnostics from affected station

Action: Run the following commands to BOTH the target service AND a stable, high-bandwidth public endpoint (e.g., google.com or cloudflare.com) for comparison.
- Extended ping (50-100 pings or more) to show min/avg/max latency and packet loss.
- ping 1.1.1.1 (Cloudflare)
- ping 8.8.8.8 (Google)
- pathping (if on Windows)
- DNS lookups: nslookup, dig, drill
Priority reason: This is the fastest data to gather. The comparison test is critical and immediately tells you if the problem is specific to the service path or a general internet issue from that station.

2. Perform physical hardware inspection

Action: Conduct an exhaustive physical hardware inspection for the affected station(s). Check for faulty Ethernet cables, improperly seated network cards, or issues with the physical switch port.
Priority reason: Physical layer problems are fundamental and can mimic more complex network issues. Ruling them out first prevents wasted time on software-level diagnosis.

3. Verify local station configuration

Action: Check the station's DNS settings, local proxy configurations (for any redirects), local security/firewall software, and investigate DHCP server logs to confirm the station is not experiencing frequent DHCP renegotiations.
Priority reason: These are client-specific settings that can cause issues for a single user without affecting the wider network. It's the next logical place to look after ruling out physical and general internet problems.

4. Review local system logs for recent events

Action: From the affected station, check system event logs (e.g., Windows Event Viewer, macOS Console, or Linux journalctl/dmesg) for any recent network-related errors, warnings, or significant changes that align with the onset of the performance issue. Look for events related to network adapter state changes, DHCP lease failures, DNS resolution problems, or firewall/security software alerts.
Priority reason: This helps identify if the problem is a new occurrence or linked to a recent configuration change or a preceding localized network issue that might have left traces on the client machine.

Phase 2: Correlated infrastructure analysis

Use specific event data to investigate shared network infrastructure. This phase requires IT access and collaboration.

5. Pinpoint timestamps and review internal logs

Action: Correlate user-reported slowdown times with internal network device logs. Look for:
- High network jitter
- Packet loss or high retransmission rates
- Review diagram of network and provide logs
- Network logs during timeframe of issue (including proxy if relevant)
- Local network congestion or bandwidth saturation events
Priority reason: This focuses the investigation on precise moments, making it far more efficient than searching through days of logs. This must be done before engaging external parties.

6. Review firewall and security appliance performance

Action: During the problem timestamps, review firewall/proxy logs for blocked TCP/UDP connections and check appliance resources (CPU, RAM, connections). Also, please confirm that all required Tulip IP addresses and domains are currently whitelisted and not subject to overly aggressive inspection that might add latency. Pay attention to proxy-specific metrics like connection limits, concurrent sessions, and latency introduced by deep packet inspection.
Priority reason: Security appliances are a common bottleneck. This checks for both explicit blocking rules and performance degradation from overloaded hardware or intrusive inspection policies.

7. Examine Quality of Service (QoS) configuration

Action: If you have QoS policies implemented, verify how traffic to and from the application's services is categorized. Ensure it is not being accidentally deprioritized.
Priority reason: QoS misconfiguration is a specific and often overlooked cause of application performance issues, especially during periods of high network activity.

8. Analyze overall bandwidth and SNMP data

Action: Review site-wide peak bandwidth usage and check SNMP data from routers/switches for high utilization or errors during the problem times.
Priority reason: This determines if the issue is related to overall network capacity or contention.

9. Analyze ISP network path

Action: Engage your ISP, providing them with the specific timestamps of high latency.
Specifics for ISP: Ask them to look for suboptimal routing, peering point congestion, packet loss/increased latency on their segments or transit paths, or any relevant network policies active during those specific times.
Priority reason: This involves an external party and is most effective when you can provide them with precise data and timestamps from your internal investigation.

Phase 3: Advanced deep dive

Use highly technical, resource-intensive methods to find the most elusive problems.

10. Advanced packet-level deep dive

Action: Perform a full packet capture (pcap) on an affected machine during a slowdown event. The key is a specific focus on timestamp correlation across all layers of analysis. Analyze the capture for packet manipulation (e.g., TTL changes), protocol errors (TCP retransmissions), or specific ICMP error messages.
Priority reason: This is the most complex step. It provides the ultimate ground truth but is reserved for when all other investigations have failed to find the cause.