Data Flow
Understanding the flow of data is crucial to comprehending how AlbertPlus works. The platform is designed around a robust data pipeline that ensures course information is accurate, up-to-date, and delivered efficiently to the user.
Course Data Pipeline
Section titled “Course Data Pipeline”The primary data pipeline is responsible for collecting, storing, and serving course and program information.
Static Course & Program Data (Manual Trigger)
Section titled “Static Course & Program Data (Manual Trigger)”- Scraping (Cloudflare Worker)
- Admin Trigger: Admin users initiate scraping by calling Convex actions (
api.scraper.triggerMajorsScrapingorapi.scraper.triggerCoursesScraping). - Authenticated Request: The Convex action makes a POST request to the scraper’s HTTP endpoints (
/api/trigger-majorsor/api/trigger-courses) with theCONVEX_API_KEYin theX-API-KEYheader. - API Key Validation: The scraper validates the API key to ensure the request is from the trusted Convex backend.
- Discovery: The scraper discovers the URLs for all available programs or courses from NYU’s public course catalog.
- Job Queuing: For each discovered program or course, a new job is added to a Cloudflare Queue. This allows for resilient and distributed processing.
- Data Extraction: Each job in the queue is processed by the worker, which scrapes the detailed information for a specific course or program.
- Upsert to Backend: The scraped data is sent back to the Convex backend via authenticated HTTP endpoints.
- Admin Trigger: Admin users initiate scraping by calling Convex actions (
Dynamic Course Offerings Data (Scheduled)
Section titled “Dynamic Course Offerings Data (Scheduled)”-
Automated Scraping (Cloudflare Worker Cronjob)
- Scheduled Trigger: A cronjob runs at regular intervals (configured in
wrangler.jsonc). - Config Check: The worker reads app configuration from Convex to determine which terms to scrape (
is_scrape_current,is_scrape_next, along with term/year information). - Albert Public Search: For each enabled term, the worker scrapes Albert’s public class search to discover all course offering URLs.
- Job Queuing: Each course offering URL is added to the queue as a
course-offeringjob with metadata about the term and year. - Section Details: Each job scrapes detailed information including:
- Class number, section, and status (open/closed/waitlist)
- Instructor names and location
- Meeting days, start time, and end time
- Corequisite relationships
- Batch Upsert: Scraped course offerings are sent to Convex in batches via the
/api/courseOfferings/upsertendpoint.
- Scheduled Trigger: A cronjob runs at regular intervals (configured in
-
Backend Processing (Convex)
- Data Reception: The Convex backend receives the scraped data from the Cloudflare Worker.
- Database Storage: The data is upserted into the Convex database, ensuring that existing records are updated and new ones are created. This includes courses, programs, requirements, prerequisites, and course offerings.
- Real-time Updates: Any clients connected to the Convex backend (such as the web app) will receive real-time updates as the new data is written to the database.
-
Client-side Consumption (Web App & Browser Extension)
- Data Fetching: The Next.js web app and the browser extension query the Convex backend to fetch course, program, and course offering data.
- User Interface: The data is then rendered in the user interface, allowing students to browse the course catalog, view program requirements, check real-time class availability, and build their schedules.
Degree Progress Report Parsing
Section titled “Degree Progress Report Parsing”Another important data flow involves the parsing of a student’s degree progress report.
- File Upload: The user uploads their degree progress report (in PDF format) through the web application.
- Client-side Parsing: The PDF is parsed directly in the browser using the
pdfjs-distlibrary. This approach enhances privacy as the user’s academic records are not sent to the server. - Data Extraction: The parsed text is then processed to extract the student’s completed courses and grades.
- Backend Storage: The extracted course information is stored in the
userCoursestable in the Convex database, associated with the authenticated user’s ID. - Degree Audit: This stored data can then be used to compare against program requirements, providing the student with an audit of their academic progress.