Fix: Knowledge Base Uploads Stuck On Production

by Admin 48 views
Fix: Knowledge Base Uploads Stuck on Production

Having trouble with your knowledge base uploads getting stuck on production? You're not alone! Many users have experienced this frustrating issue, where uploads get stuck at various percentages, like 25%, 31%, 65%, or even 98%. The good news is, there's a solution! This article dives deep into the root cause of the problem and offers a clear, actionable fix to get your uploads running smoothly again.

The Problem: Stuck Knowledge Base Uploads

So, what's the deal? You kick off a knowledge base upload, everything seems fine initially, but then it grinds to a halt. It's like hitting a brick wall, and you're left wondering why. This issue primarily affects production environments, while uploads often work perfectly fine on local setups.

Here are some examples of stuck uploads reported by users:

  • "Awareness: The Perils and Opportunities of Reality" - Stuck at 50/162 chunks (31%)
  • "Rewire: A Radical Approach..." - Stuck at 150/229 chunks (65%)
  • "The Untethered Soul" - Stuck at 192/196 chunks (98%)

Imagine getting so close to the finish line, only to have the upload stall at 98%! Frustrating, right? Understanding the root cause is the first step to resolving this issue and ensuring smooth knowledge base updates.

Root Cause: The Batch Processing Bottleneck

The culprit behind these stuck uploads lies in the batch processing system. This system relies on the wp_remote_post() function to automatically trigger subsequent batches. Specifically, lines 1924-1939 in class-ccc-rag-handler.php are where the trouble begins. Let's take a look at the code snippet:

wp_remote_post(admin_url('admin-ajax.php'), array(
    'blocking' => false,
    'timeout' => 0.01,
    'body' => array(
        'action' => 'ccc_process_next_batch',
        'session_id' => $session_id,
        'security' => wp_create_nonce('pm_message_nonce')
    )
));

This code attempts to initiate the next batch of the upload process. However, several factors can cause this process to fail on production servers. Let's break down why this approach is unreliable in a production environment.

1. Loopback Requests Blocked

Many production servers are configured to block loopback HTTP requests. What are loopback requests? These are requests where the server tries to connect to itself. This security measure is in place to prevent certain types of attacks. However, in this case, it prevents the wp_remote_post() function from working correctly, as it's essentially trying to make a request back to the same server.

2. Nonce Verification Issues

Nonces are security tokens used to prevent Cross-Site Request Forgery (CSRF) attacks. In this context, the background requests triggered by wp_remote_post() create new nonces. The problem is that these new nonces might fail verification, leading to the batch processing being interrupted. It's like having a secret handshake that's constantly changing, and sometimes the handshake doesn't match up.

3. Timeout Too Short

The timeout parameter is set to a mere 0.01 seconds. This incredibly short timeout barely gives the request a chance to initiate, let alone complete. It's like trying to start a car with a dead battery – you might get a flicker, but it's not going anywhere.

4. No Retry Mechanism

If a single batch fails, the entire upload process gets stuck. There's no built-in mechanism to retry failed batches. It's a single point of failure, and if that point fails, the whole process grinds to a halt.

5. Browser Dependency

The frontend polling mechanism relies on the user staying on the page. If the user navigates away or closes the browser, the polling stops, and the upload gets stuck. It's like needing to keep your foot on the gas pedal the entire time, or the car will stall.

In summary, the reliance on wp_remote_post() with a short timeout, combined with potential loopback blocking and nonce issues, creates a fragile system that is prone to failure in production environments. We need a more robust and reliable solution.

Proposed Solution: True Background Processing

To overcome these limitations, we need to replace the unreliable wp_remote_post() approach with a true background processing mechanism. This will ensure that uploads continue processing even if the user navigates away from the page, and it will provide a more robust and reliable way to handle batch processing. Let's explore two viable options:

Option 1: WP-Cron (Recommended)

WP-Cron is WordPress's built-in scheduling system. It allows you to schedule events to run in the background at specific times or intervals. In this case, we can use wp_schedule_single_event() to queue the next batch for processing.

Why WP-Cron is a good choice:

  • More reliable than loopback HTTP requests: WP-Cron doesn't rely on the server making requests to itself, so it avoids the loopback blocking issue.
  • Works even if the user closes the browser: Once an event is scheduled with WP-Cron, it will run regardless of whether the user is still on the page.
  • Already tested pattern in other plugins: WP-Cron is a well-established and widely used mechanism in the WordPress ecosystem, so it's a proven and reliable solution.

How it works:

Instead of using wp_remote_post(), we'll schedule an event using wp_schedule_single_event(). This event will then trigger the ccc_process_next_batch action in the background. This approach ensures that the next batch is processed even if the initial request fails or the user leaves the page.

Option 2: Action Scheduler

Action Scheduler is a third-party library specifically designed for handling background jobs in WordPress. It provides a more robust and feature-rich alternative to WP-Cron.

Why Action Scheduler is a good choice:

  • Better retry logic: Action Scheduler has built-in retry mechanisms, so if a job fails, it will automatically retry it a certain number of times.
  • More advanced features: Action Scheduler offers features like job prioritization, batch processing, and logging.

Why it might be overkill:

  • Third-party dependency: It adds an external dependency to your plugin.
  • More complex: It might be more complex to implement than WP-Cron for this specific use case.

How it works:

Action Scheduler allows you to schedule actions that will run in the background. You can define the action, the arguments, and the schedule. Action Scheduler will then handle the execution of the action, including retries and logging.

Recommendation:

For this specific issue, WP-Cron is the recommended solution. It provides a good balance between reliability and simplicity. It's a built-in WordPress feature, so it doesn't introduce any external dependencies, and it's relatively easy to implement. However, if you need more advanced features like job prioritization or more robust retry logic, Action Scheduler is a viable alternative.

Acceptance Criteria: Ensuring the Fix Works

To ensure that the chosen solution effectively addresses the issue, we need to define clear acceptance criteria. These criteria will serve as a checklist to verify that the fix is working as expected.

  • [ ] Uploads continue processing even if the user navigates away: This is the most critical criterion. The upload process should not be interrupted if the user closes the browser or navigates to another page.
  • [ ] Failed batches retry automatically (max 3 attempts): A retry mechanism is essential for handling transient errors. If a batch fails due to a temporary issue, it should be retried automatically.
  • [ ] Resume button works for stuck uploads: Users should be able to resume uploads that have previously gotten stuck. This will allow them to complete interrupted uploads without starting from scratch.
  • [ ] Works on production environment: The fix must work reliably in a production environment, where the original issue was observed.
  • [ ] Existing stuck uploads can be resumed and completed: The fix should not only prevent future uploads from getting stuck but also allow users to resume and complete existing stuck uploads.
  • [ ] Test with a large file (200+ chunks) on production: Thorough testing with large files is crucial to ensure that the fix can handle real-world scenarios.

By meeting these acceptance criteria, we can be confident that the fix effectively addresses the issue of stuck knowledge base uploads on production.

Files to Modify: Implementing the Solution

To implement the chosen solution, we'll need to modify specific files within the plugin. Here's a breakdown of the files that will need to be updated:

  • wp-content/plugins/ccc-insights/includes/class-ccc-rag-handler.php: This is the main file where the batch processing logic resides. We'll need to replace the wp_remote_post() calls with the appropriate WP-Cron scheduling mechanism.
  • Possibly add migration to clean up stuck uploads: In addition to fixing the upload process, we might need to add a database migration to identify and clean up any existing stuck uploads. This will ensure that users can resume these uploads after the fix is deployed.

Specifically, within class-ccc-rag-handler.php, we'll need to:

  1. Remove the existing wp_remote_post() code block.
  2. Implement the WP-Cron scheduling mechanism using wp_schedule_single_event().
  3. Ensure that the ccc_process_next_batch action is correctly triggered by the scheduled event.
  4. Add retry logic to handle potential batch failures.
  5. Implement a mechanism to allow users to resume stuck uploads.

By carefully modifying these files and following the acceptance criteria, we can effectively resolve the issue of stuck knowledge base uploads and provide a smoother user experience.

In conclusion, fixing stuck knowledge base uploads on production requires a shift from unreliable loopback requests to a robust background processing mechanism. By implementing WP-Cron and adhering to the acceptance criteria, we can ensure that uploads complete reliably, even if users navigate away or encounter temporary errors. This will lead to a more efficient and frustration-free experience for everyone.