Scraper
The scraper, located in the apps/scraper directory, is a critical component of the AlbertPlus platform. It is a Cloudflare Worker responsible for automatically collecting and updating course and program data from NYU’s public-facing systems. This ensures that the information presented to students in the web app is accurate and up-to-date.
Key Technologies
Section titled “Key Technologies”- Runtime: Cloudflare Workers, a serverless execution environment that is fast, scalable, and cost-effective.
- Framework: Hono, a lightweight and fast web framework for the edge.
- Database: Cloudflare D1, a serverless SQLite database, used for managing the scraping job queue.
- ORM: Drizzle ORM, a TypeScript ORM that provides a type-safe way to interact with the D1 database.
- Job Queue: The scraper uses a custom job queue implementation built on top of D1 and Cloudflare Queue to manage the scraping tasks.
Scraping Process
Section titled “Scraping Process”The scraping process is designed to be robust and resilient:
Manual Scraping (Programs & Courses)
Section titled “Manual Scraping (Programs & Courses)”- Admin Trigger: Admin users can trigger scraping through the Convex backend by calling dedicated actions:
api.scraper.triggerMajorsScraping- Initiates major (program) discoveryapi.scraper.triggerCoursesScraping- Initiates course discovery
- HTTP Endpoints: These actions make authenticated POST requests to the scraper’s HTTP endpoints:
POST /api/trigger-majors- Creates a major discovery jobPOST /api/trigger-courses- Creates a course discovery job
- API Key Authentication: The endpoints validate the
X-API-KEYheader against theCONVEX_API_KEYto ensure requests originate from the trusted Convex backend. - Job Discovery: The initial job discovers all the available programs and courses and creates individual jobs for each one.
- Queueing: These individual jobs are added to a Cloudflare Queue and tracked in the D1 database.
- Job Processing: The Cloudflare Worker processes jobs from the queue, scraping the data for each course or program.
- Data Upsert: The scraped data is then sent to the Convex backend via authenticated HTTP requests to be stored in the main database.
- Error Handling: The system includes error logging and a retry mechanism for failed jobs.
Automated Scraping (Course Offerings)
Section titled “Automated Scraping (Course Offerings)”Course offerings (class sections with schedule details) are scraped automatically via a scheduled Cloudflare Worker cronjob:
- Scheduled Trigger: The worker runs on a schedule defined in
wrangler.jsoncto check for new course offerings. - App Config Check: The worker reads the following configuration from Convex:
is_scrape_current- Boolean flag to enable scraping current termis_scrape_next- Boolean flag to enable scraping next termcurrent_term/current_year- Identifies the current academic termnext_term/next_year- Identifies the next academic term
- Discovery Jobs: For each enabled term, the worker creates a
discover-course-offeringsjob that scrapes Albert’s public search to find all course offering URLs. - Individual Jobs: Each discovered course offering URL becomes a
course-offeringjob in the queue. - Data Processing: The worker scrapes details such as class number, section, instructor, schedule, location, and enrollment status.
- Backend Sync: Scraped course offerings are sent to Convex via the
/api/courseOfferings/upsertendpoint in batches.
Job Types
Section titled “Job Types”The scraper supports the following job types, tracked in the D1 database:
| Job Type | Description |
|---|---|
discover-programs | Discovers all program URLs from the bulletin |
discover-courses | Discovers all course URLs from the bulletin |
discover-course-offerings | Discovers course offering URLs from Albert public search for a specific term/year |
program | Scrapes detailed data for a single program |
course | Scrapes detailed data for a single course |
course-offering | Scrapes detailed data for a single course offering (section) |
Jobs can include metadata (stored as JSON) to pass contextual information such as the academic term and year.
Project Structure
Section titled “Project Structure”The scraper’s code is organized as follows:
src/index.ts: The main entry point for the Cloudflare Worker, including the scheduled and queue handlers.src/drizzle/: The Drizzle ORM schema and database connection setup.src/lib/: Core libraries for interacting with Convex and managing the job queue.src/modules/: The logic for discovering and scraping courses, programs, and course offerings.programs/: Program discovery and scraping logiccourses/: Course discovery and scraping logiccourseOfferings/: Course offering discovery and scraping logic (in progress)