Fixing Duplicate Venues: A TOCTOU Race Condition Solution

by Admin 58 views
Fixing Duplicate Venues: A TOCTOU Race Condition Solution

Hey guys! Let's dive into a tricky problem we encountered with duplicate venues in Eventasaurus and how we tackled it. This was a deep dive into the heart of our database interactions, and I'm excited to share the journey and the solution we landed on. So, buckle up, and let's get started!

Root Cause: TOCTOU Race Condition

The core issue? A TOCTOU (Time-of-Check-Time-of-Use) race condition at the database level. Sounds technical, right? Basically, this means that our application-level checks for duplicates weren't cutting it because multiple processes could check for the existence of a venue at the same time, find nothing, and then try to create the venue simultaneously. This is a classic concurrency problem, and it required a robust solution to prevent those pesky duplicates from sneaking in. Let's explore some real-world examples from our production environment to understand the gravity of the situation.

Evidence from Production

To really grasp the problem, let's look at some concrete examples.

Hope and Anchor, Brixton:

  • ID 248 created at 09:54:08
  • ID 249 created at 09:54:09 (just one second later!)
  • EXACT same GPS coordinates (51.4595646, -0.1266272)
  • Same scraper (question_one)
  • Different geocoding providers (geoapify vs photon)

Islington Town House:

  • ID 20 and ID 22 created at the SAME SECOND (09:54:24)
  • 10m apart (GPS slightly different due to rounding)
  • Same scraper (inquizition)
  • Provided coordinates (no geocoding)

These examples highlight the severity of the problem. Identical venues were being created within seconds or even milliseconds of each other, leading to data duplication and potential confusion for our users. The key takeaway here is that these duplicates were happening despite our existing checks, which brings us to the next crucial point: the race condition timeline.

The Race Condition Timeline

Visualizing the timeline of events really helped us pinpoint the issue. Imagine this scenario:

Time 09:54:08.000: Worker 1 geocodes Hope and Anchor
Time 09:54:08.100: Worker 1 validates (SELECT ... WHERE ST_DWithin...), finds nothing
Time 09:54:08.200: Worker 2 geocodes Hope and Anchor
Time 09:54:08.300: Worker 2 validates (SELECT ... WHERE ST_DWithin...), finds nothing
                    ⚠️ Worker 1 hasn't committed yet!
Time 09:54:08.400: Worker 1 inserts and commits
Time 09:54:08.500: Worker 2 inserts and commits
Result: DUPLICATE CREATED ❌

This timeline clearly illustrates the TOCTOU race condition. Both workers independently check for the venue's existence and, finding none, proceed to insert it. The critical flaw? Worker 2's check occurs before Worker 1's transaction is committed. It’s like two people trying to grab the last slice of pizza simultaneously – chaos ensues! This understanding is crucial because it explains why our previous attempts to fix the issue fell short.

Why Previous Fixes Failed

We tried a couple of approaches before landing on the ultimate solution. Let's break down why they didn't work.

Attempt 1: provider_ids check after geocoding

  • Problem: This still suffered from the race condition. Both workers would check the provider_ids before either committed, leading to the same concurrency issue.

Attempt 2: GPS+name check after geocoding

This attempt had multiple shortcomings:

  1. Still had the race condition: Just like Attempt 1, this check was vulnerable to concurrent processes stepping on each other.
  2. Didn't run for "provided coordinates" path (line 689-704): This was a critical oversight. Venues with provided coordinates bypassed this check entirely, leaving them wide open for duplication.
  3. Application-level checks fundamentally cannot solve database-level races: This is a core principle. Application-level checks can be bypassed by concurrent database operations.

These failed attempts highlighted a crucial lesson: the fix needed to happen at the database level to truly guarantee data integrity. To understand why, let's analyze the current code flow.

Current Code Flow Analysis

Let's trace the path of venue creation through our codebase to see where things were going wrong. We'll be focusing on lib/eventasaurus_discovery/scraping/processors/venue_processor.ex.

  1. Line 441: Initial find_existing_venue call (before geocoding)
    • For UK scrapers: no coordinates available yet → GPS matching fails. This means the initial check often couldn't prevent duplicates because it lacked sufficient information.
  2. Lines 631-688: Geocoding and duplicate checks
    • Step 1: Check by provider_ids (lines 640-656)
    • Step 2: Check by GPS+name (lines 658-674)
    • Problem: Both checks happen OUTSIDE a transaction. This is the heart of the race condition problem.
  3. Lines 689-704: "Provided coordinates" path
    • Critical: NO duplicate checking at all! This was a major vulnerability and directly explains the Islington Town House duplicates.
  4. Line 943: Venue.changeset |> Repo.insert
    • Changeset validation runs (line 206: validate_no_duplicate)
    • But validation also happens outside the transaction! So, again, it's susceptible to the race condition.

This step-by-step analysis reveals the fundamental flaw: checks were happening outside the protective embrace of a database transaction. This leads us to the core of the issue: the TOCTOU gap.

The Fundamental Problem

The crux of the matter is the TOCTOU gap: there's a window of time between checking for duplicates and inserting the venue where another worker can sneak in and create the same venue. Here's a simplified code snippet to illustrate the broken flow:

def insert_venue(attrs) do
  # CHECK happens here (outside transaction)
  case find_duplicate(attrs) do
    nil ->
      # ⚠️ GAP: Another worker can insert here!
      Repo.insert(changeset)  # USE happens here
  end
end

The comment ⚠️ GAP: Another worker can insert here! is the key. To solve this, we needed a mechanism to serialize these inserts, ensuring that only one worker could create a venue at a given location at a time. Enter PostgreSQL Advisory Locks – our superhero for this challenge!

Solution: PostgreSQL Advisory Locks

PostgreSQL Advisory Locks are a fantastic feature that allows us to serialize concurrent operations. Think of them as a database-level semaphore. They let us lock a specific resource (in our case, a venue location) so that only one process can modify it at a time. This eliminates the TOCTOU race condition by ensuring that the check-and-insert operation is atomic.

Implementation

Here's the Elixir code we crafted to implement advisory locks:

defp insert_with_duplicate_lock(attrs, city_id) do
  # Round to ~50m grid to match duplicate detection threshold
  lat_rounded = Float.round(attrs.latitude, 3)  # ~111m precision
  lng_rounded = Float.round(attrs.longitude, 3)  # ~85m at London's latitude

  # Generate lock key from location + city
  lock_key = :erlang.phash2({lat_rounded, lng_rounded, city_id})

  Repo.transaction(fn ->
    # Acquire advisory lock for this location (held until transaction ends)
    Repo.query!("SELECT pg_advisory_xact_lock($1)", [lock_key])

    # Now safe to check and insert atomically
    case find_existing_venue(attrs) do
      nil ->
        # No duplicate, insert new venue
        case Repo.insert(changeset) do
          {:ok, venue} -> venue
          {:error, changeset} -> Repo.rollback(changeset)
        end

      existing ->
        # Found duplicate, return it
        existing
    end
  end)
end

Let's break this down:

  1. Rounding Coordinates: We round the latitude and longitude to three decimal places. This creates a grid of approximately 50m, which matches our duplicate detection threshold. This ensures that venues within a reasonable proximity are considered for locking.
  2. Generating Lock Key: We create a unique lock key by hashing the rounded coordinates and the city ID using :erlang.phash2. This ensures that locks are specific to a location within a city.
  3. Transaction: We wrap the entire operation in a Repo.transaction. This is crucial because advisory locks are held until the transaction ends, guaranteeing atomicity.
  4. Acquiring Lock: We use `Repo.query!(