‘crawl4ai.com’ – open-source web crawler and scraper designed to facilitate efficient data extraction for AI applications involving Large Language Models (LLMs)

Crawl4AI is an open-source web crawler and scraper designed to facilitate efficient data extraction for AI applications, particularly those involving Large Language Models (LLMs). It offers a range of features tailored to meet the needs of developers and data scientists.

crawl4ai.com

Key Features:

  1. Clean Markdown Generation: Crawl4AI converts web content into clean Markdown format, making it ideal for retrieval-augmented generation (RAG) pipelines or direct ingestion into LLMs.
  1. Structured Data Extraction: The tool supports parsing repeated patterns using CSS selectors, XPath, or LLM-based extraction methods, enabling precise data retrieval from complex web structures.

  1. Advanced Browser Control: Crawl4AI provides fine-grained browser control, including features like hooks, proxies, stealth modes, and session reuse, enhancing its ability to handle dynamic web content.

  1. High Performance: Designed for speed, Crawl4AI supports parallel crawling and chunk-based extraction, making it suitable for real-time use cases.

  1. Open Source: As an open-source tool, Crawl4AI is accessible without forced API keys or paywalls, allowing users to fully control their data extraction processes.

Usage Examples:

Simple Crawling: To perform a basic crawl and convert a webpage into Markdown:pythonCopyEditfrom crawl4ai import AsyncWebCrawler from crawl4ai.async_configs import CrawlerRunConfig async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com", config=CrawlerRunConfig(fit_markdown=True) ) print(result.markdown) .This script fetches the content from “https://example.com” and outputs it in Markdown format. docs.crawl4ai.com

Structured Data Extraction with CSS Selectors: To extract structured data using CSS selectors:pythonCopyEditfrom crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import JsonCssExtractionStrategy async with AsyncWebCrawler() as crawler: strategy = JsonCssExtractionStrategy({ "title": {"selector": "h1.title"}, "price": {"selector": "span.price"} }) result = await crawler.arun( url="https://example.com/product", extraction_strategy=strategy ) print(result.extracted_data) This script extracts the product title and price from a webpage using specified CSS selectors.

    Handling Dynamic Content: To interact with dynamic web pages, such as clicking a “Load More” button:pythonCopyEditfrom crawl4ai import AsyncWebCrawler from crawl4ai.async_configs import CrawlerRunConfig async with AsyncWebCrawler() as crawler: run_config = CrawlerRunConfig( js_code="document.querySelector('button.load-more').click();", wait_for={"selector": ".new-content"} ) result = await crawler.arun( url="https://example.com/articles", config=run_config ) print(result.markdown) This script clicks the “Load More” button on a webpage and waits for new content to load before extracting it.

      Multi-URL Crawling: To crawl multiple URLs concurrently:pythonCopyEditfrom crawl4ai import AsyncWebCrawler from crawl4ai.async_configs import CrawlerRunConfig async with AsyncWebCrawler() as crawler: urls = ["https://example.com/page1", "https://example.com/page2"] results = await crawler.arun_many( urls=urls, config=CrawlerRunConfig(fit_markdown=True) ) for result in results: print(result.url, result.markdown) This script concurrently crawls multiple URLs and outputs their content in Markdown format.

        Crawl4AI’s comprehensive feature set and flexibility make it a powerful tool for web data extraction, particularly in AI and machine learning contexts.

        https://crawl4ai.com