Day 1 - Benchmarking AI-powered scraping systems

12/01/25

tldr: Ran a benchmark exploring how well LLM providers' web search tools can be used to scrape real estate agent emails. Gemini 3 wins. This approach isn't quite robust enough for to be production-ready for this use case, but it shows promise for other applications.

A few months ago, I was helping a client do cold outreach to real estate agents in the bay area. To do so, I needed to scrape a lead list of every agent along with their email. I started with a list of 12 of the largest brokerage firms in the bay area, and then individually scraped each firm's employee directory by directing claude code + playwright to crawl the directory, and then had it write a custom playwright script to paginate and extract each person's contact info. This worked extremely well, and I managed to accurate contact info for ~1000 real estate agents. But the method is cumbersome, and I'd need to start with a list of every single brokerage.

It struck me one day that it would be much more straightforward to start out with an individual agent's name and license number, and then simply make an api call to one of the major LLM providers paired with the provider's native web search tool, and have it search the web to find the email. Web search capabilities for all the major providers are now provided as native tools and are pretty advanced, which made this an attractive and novel approach for me to try out. Also, getting a comprehensive list of all licensed real estate professionals in California is easy since it's publicly available in the California DRE database.

I ran a benchmark to compare the results between using OpenAI and Gemini, using a small subset of the results from my original method as the ground truth baseline. I quickly found that OpenAI's native web search tool was lacking in performance. I had better success pairing OpenAI with Parallel System's web search via a MCP server.

The results are below.

Rank	Provider	Model	Reasoning	Accuracy	Avg Latency	Avg Tokens	Email Not Found
1	Gemini	gemini-3-pro-preview	High	90.7% (107/118)	96.55s	19,965	6 (5.1%)
2	OpenAI + Parallel Web Search	gpt-5-mini	High	80.5% (95/118)	129.54s	63,797	7 (5.9%)
3	OpenAI + Parallel Web Search	gpt-5-mini	None	74.6% (88/118)	77.31s	35,956	15 (12.7%)

Along with the extracted email for that agent, I also had the system return the link to the web page that contained the email, so I could go in and verify that it was there and correctly matched to the agent.

Digging deeper into the results, I noticed that one common pitfall of most of the providers' web search tools is that they'll rely on pages in their index that have since rotted or gone out of date. The problem of link decay and, more generally, outdated pages I've found is common in real estate. Property listing pages that might've contained an agent's email are taken down after the property is sold, and agents often move brokerages, so it's hard to tell whether their email you're seeing on an active web page is for their current brokerage or a previous one, as these kinds of pages don't get updated with their new contact info.

An interesting observation is that Parallel has indexed the contents of PDFs. Several times the canonical page link returned by Parallel was to some PDF brochure or small business newsletter that contained the agent's email. I only saw this behavior when using their API.

My conclusion after working on this is that this method isn't quite ready for prime time yet for scraping real estate agent data, or at least until I implement measures to cover these shortcomings. But I do think this approach is potentially very useful and powerful for other applications, so I'll be keeping it in my back pocket for when a use case more suitable comes along.

Note: I'm doing a project a day, which necessitates I skip steps I otherwise would've taken if I weren't so rushed. One such example is putting the benchmark code on github. I'll get to it when I have the chance.