Scraper

The scraper, located in the apps/scraper directory, is a critical component of the AlbertPlus platform. It is a Cloudflare Worker responsible for automatically collecting and updating course and program data from NYU’s public-facing systems. This ensures that the information presented to students in the web app is accurate and up-to-date.

Key Technologies

Runtime: Cloudflare Workers, a serverless execution environment that is fast, scalable, and cost-effective.
Framework: Hono, a lightweight and fast web framework for the edge.
Database: Cloudflare D1, a serverless SQLite database, used for managing the scraping job queue.
ORM: Drizzle ORM, a TypeScript ORM that provides a type-safe way to interact with the D1 database.
Job Queue: The scraper uses a custom job queue implementation built on top of D1 and Cloudflare Queue to manage the scraping tasks.

Scraping Process

The scraping process is designed to be robust and resilient:

Manual Scraping (Programs & Courses)

Admin Trigger: Admin users can trigger scraping through the Convex backend by calling dedicated actions:
- api.scraper.triggerMajorsScraping - Initiates major (program) discovery
- api.scraper.triggerCoursesScraping - Initiates course discovery
HTTP Endpoints: These actions make authenticated POST requests to the scraper’s HTTP endpoints:
- POST /api/trigger-majors - Creates a major discovery job
- POST /api/trigger-courses - Creates a course discovery job
API Key Authentication: The endpoints validate the X-API-KEY header against the CONVEX_API_KEY to ensure requests originate from the trusted Convex backend.
Job Discovery: The initial job discovers all the available programs and courses and creates individual jobs for each one.
Queueing: These individual jobs are added to a Cloudflare Queue and tracked in the D1 database.
Job Processing: The Cloudflare Worker processes jobs from the queue, scraping the data for each course or program.
Data Upsert: The scraped data is then sent to the Convex backend via authenticated HTTP requests to be stored in the main database.
Error Handling: The system includes error logging and a retry mechanism for failed jobs.

Automated Scraping (Course Offerings)

Course offerings (class sections with schedule details) are scraped automatically via a scheduled Cloudflare Worker cronjob:

Scheduled Trigger: The worker runs on a schedule defined in wrangler.jsonc to check for new course offerings.
App Config Check: The worker reads the following configuration from Convex:
- is_scrape_current - Boolean flag to enable scraping current term
- is_scrape_next - Boolean flag to enable scraping next term
- current_term / current_year - Identifies the current academic term
- next_term / next_year - Identifies the next academic term
Discovery Jobs: For each enabled term, the worker creates a discover-course-offerings job that scrapes Albert’s public search to find all course offering URLs.
Individual Jobs: Each discovered course offering URL becomes a course-offering job in the queue.
Data Processing: The worker scrapes details such as class number, section, instructor, schedule, location, and enrollment status.
Backend Sync: Scraped course offerings are sent to Convex via the /api/courseOfferings/upsert endpoint in batches.

Job Types

The scraper supports the following job types, tracked in the D1 database:

Job Type	Description
`discover-programs`	Discovers all program URLs from the bulletin
`discover-courses`	Discovers all course URLs from the bulletin
`discover-course-offerings`	Discovers course offering URLs from Albert public search for a specific term/year
`program`	Scrapes detailed data for a single program
`course`	Scrapes detailed data for a single course
`course-offering`	Scrapes detailed data for a single course offering (section)

Jobs can include metadata (stored as JSON) to pass contextual information such as the academic term and year.

Project Structure

The scraper’s code is organized as follows:

src/index.ts: The main entry point for the Cloudflare Worker, including the scheduled and queue handlers.
src/drizzle/: The Drizzle ORM schema and database connection setup.
src/lib/: Core libraries for interacting with Convex and managing the job queue.
src/modules/: The logic for discovering and scraping courses, programs, and course offerings.
- programs/: Program discovery and scraping logic
- courses/: Course discovery and scraping logic
- courseOfferings/: Course offering discovery and scraping logic (in progress)