Data Pipeline Documentation¶
Semester.ly’s data pipeline provides the infrastructure by which the database is filled with course information. Whether a given University offers an API or an online course catalogue, this pipeline lends developers an easy framework to work within to pull that information and save it in our Django Model format.
General System Workflow¶
- Pull HTML/JSON markup from a catalogue/API
- Map the fields of the mark up to the fields of our ingestor (by simply filling a python dictionary).
- The ingestor preprocesses the data, validates it, and writes it to JSON.
- Load the JSON into the database.
Note
This process happens automatically via Django/Celery Beat Periodict Tasks. You can learn more about these schedule tasks below (Scheduled Tasks).
Steps 1 and 2 are what we call parsing – an operation that is non-generalizable across all Universities. Often a new parser must be written. For more information on this, read Add a School.
Parsing Library Documentation¶
Base Parser¶
-
class
parsing.library.base_parser.
BaseParser
(school, config=None, output_path=None, output_error_path=None, break_on_error=True, break_on_warning=False, skip_duplicates=True, display_progress_bar=False, validate=True, tracker=None)[source]¶ Bases:
object
Abstract base parser for data pipeline parsers.
-
extractor
¶ parsing.library.extractor.Extractor
-
ingestor
¶
-
requester
¶
-
Requester¶
-
class
parsing.library.requester.
Requester
[source]¶ Bases:
object
-
get
(url, params='', session=None, cookies=None, headers=None, verify=True, **kwargs)[source]¶ HTTP GET.
Parameters: Examples
TODO
-
http_request
(do_http_request, type, parse=True, quiet=True, timeout=60, throttle=<function <lambda>>)[source]¶ Perform HTTP request.
Parameters: - do_http_request – function that returns request object
- type (str) – GET, POST, HEAD
- parse (bool, optional) – Specifies if return should be parsed. Autodetects parse type as html, xml, or json.
- quiet (bool, optional) – suppress output if True (default True)
- timeout (int, optional) – Description
- throttle (lambda, optional) – Description
Returns: if parse is False soup: soupified/jsonified text of http request
Return type: request object
-
static
markup
(response)[source]¶ Autodects html, json, or xml format in response.
Parameters: response – raw response object Returns: markedup response
-
Ingestor¶
-
exception
parsing.library.ingestor.
IngestionError
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineError
Ingestor error class.
-
args
¶
-
message
¶
-
-
exception
parsing.library.ingestor.
IngestionWarning
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineWarning
Ingestor warning class.
-
args
¶
-
message
¶
-
-
class
parsing.library.ingestor.
Ingestor
(config, output, break_on_error=True, break_on_warning=False, display_progress_bar=True, skip_duplicates=True, validate=True, tracker=<parsing.library.tracker.NullTracker object>)[source]¶ Bases:
dict
Ingest parsing data into formatted json.
Mimics functionality of dict.
-
ALL_KEYS
¶ set – Set of keys supported by Ingestor.
-
break_on_error
¶ bool – Break/cont on errors.
-
break_on_warning
¶ bool – Break/cont on warnings.
-
school
¶ str – School code (e.g. jhu, gw, umich).
-
skip_duplicates
¶ bool – Skip ingestion for repeated definitions.
-
tracker
¶ library.tracker – Tracker object.
-
UNICODE_WHITESPACE
¶ TYPE – regex that matches Unicode whitespace.
-
validate
¶ bool – Enable/disable validation.
-
validator
¶ library.validator – Validator instance.
-
ALL_KEYS
= set(['school_subdivision_code', 'code', 'isbn', 'author', 'prerequisites', 'instr', 'meetings', 'year', 'time_end', 'homepage', 'offerings', 'course_name', 'semester', 'cost', 'coreqs', 'fees', 'num_credits', 'detail_url', 'campus', 'size', 'remaining_seats', 'loc', 'fee', 'time_start', 'descr', 'title', 'meeting_section', 'section', 'section_type', 'enrolment', 'kind', 'dept_name', 'same_as', 'score', 'location', 'school', 'dept', 'department_code', 'instructor_name', 'areas', 'type', 'geneds', 'website', 'sections', 'description', 'waitlist', 'corequisites', 'start_time', 'instructors', 'term', 'dept_code', 'credits', 'section_code', 'course', 'section_name', 'date', 'capacity', 'instructor', 'school_subdivision_name', 'day', 'department', 'instr_name', 'department_name', 'waitlist_size', 'dates', 'name', 'level', 'textbooks', 'final_exam', 'enrollment', 'required', 'days', 'summary', 'prereqs', 'instr_names', 'instrs', 'image_url', 'end_time', 'time', 'cores', 'course_code', 'where', 'exclusions'])
-
clear
() → None. Remove all items from D.¶
-
copy
() → a shallow copy of D¶
-
fromkeys
(S[, v]) → New dict with keys from S and values equal to v.¶ v defaults to None.
-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
has_key
(k) → True if D has a key k, else False¶
-
ingest_meeting
(section, clean_only=False)[source]¶ Create meeting ingested json map.
Parameters: section (dict) – validated section object Returns: meeting Return type: dict
-
ingest_section
(course)[source]¶ Create section json object from info in model map.
Parameters: course (dict) – validated course object Returns: section Return type: dict
-
ingest_textbook_link
(section=None)[source]¶ Create textbook link json object.
Parameters: section (None, dict
, optional) – DescriptionReturns: textbook link. Return type: dict
-
items
() → list of D's (key, value) pairs, as 2-tuples¶
-
iteritems
() → an iterator over the (key, value) items of D¶
-
iterkeys
() → an iterator over the keys of D¶
-
itervalues
() → an iterator over the values of D¶
-
keys
() → list of D's keys¶
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised
-
popitem
() → (k, v), remove and return some (key, value) pair as a¶ 2-tuple; but raise KeyError if D is empty.
-
setdefault
(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶
-
update
([E, ]**F) → None. Update D from dict/iterable E and F.¶ If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
-
values
() → list of D's values¶
-
viewitems
() → a set-like object providing a view on D's items¶
-
viewkeys
() → a set-like object providing a view on D's keys¶
-
viewvalues
() → an object providing a view on D's values¶
-
Validator¶
-
exception
parsing.library.validator.
MultipleDefinitionsWarning
(data, *args)[source]¶ Bases:
parsing.library.validator.ValidationWarning
Duplicated key in data definition.
-
args
¶
-
message
¶
-
-
exception
parsing.library.validator.
ValidationError
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineError
Validator error class.
-
args
¶
-
message
¶
-
-
exception
parsing.library.validator.
ValidationWarning
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineWarning
Validator warning class.
-
args
¶
-
message
¶
-
-
class
parsing.library.validator.
Validator
(config, tracker=None, relative=True)[source]¶ Validation engine in parsing data pipeline.
-
config
¶ DotDict
– Loaded config.json.
-
tracker
¶
-
KINDS
= set(['textbook_link', 'datalist', 'meeting', 'section', 'textbook', 'course', 'config', 'eval', 'directory', 'instructor', 'final_exam'])
-
static
file_to_json
(path, allow_duplicates=False)[source]¶ Load file pointed to by path into json object dictionary.
Parameters: Returns: JSON-compliant dictionary.
Return type:
-
classmethod
load_schemas
(schema_path=None)[source]¶ Load JSON validation schemas.
- NOTE: Will load schemas as static variable (i.e. once per definition),
- unless schema_path is specifically defined.
Parameters: schema_path (None, str, optional) – Override default schema_path
-
static
schema_validate
(data, schema, resolver=None)[source]¶ Validate data object with JSON schema alone.
Parameters: Raises: jsonschema.exceptions.ValidationError
– Invalid object.
-
validate
(data, transact=True)[source]¶ Validation entry/dispatcher.
Parameters: data (list, dict) – Data to validate.
-
validate_course
(course)[source]¶ Validate course.
Parameters: course (DotDict) – Course object to validate.
Raises: MultipleDefinitionsWarning
– Course has already been validated in same session.ValidationError
– Invalid course.
-
validate_directory
(directory)[source]¶ Validate directory.
Parameters: directory (str, dict) – Directory to validate. May be either path or object. Raises: ValidationError
– encapsulated IOError
-
validate_eval
(course_eval)[source]¶ Validate evaluation object.
Parameters: course_eval (DotDict) – Evaluation to validate. Raises: ValidationError
– Invalid evaulation.
-
validate_final_exam
(final_exam)[source]¶ Validate final exam.
NOTE: currently unused.
Parameters: final_exam (DotDict) – Final Exam object to validate. Raises: ValidationError
– Invalid final exam.
-
validate_instructor
(instructor)[source]¶ Validate instructor object.
Parameters: instructor (DotDict) – Instructor object to validate. Raises: ValidationError
– Invalid instructor.
-
validate_location
(location)[source]¶ Validate location.
Parameters: location (DotDict) – Location object to validate. Raises: ValidationWarning
– Invalid location.
-
validate_meeting
(meeting)[source]¶ Validate meeting object.
Parameters: meeting (DotDict) – Meeting object to validate.
Raises: ValidationError
– Invalid meeting.ValidationWarning
– Description
-
validate_section
(section)[source]¶ Validate section object.
Parameters: section (DotDict) – Section object to validate.
Raises: MultipleDefinitionsWarning
– Invalid section.ValidationError
– Description
-
validate_self_contained
(data_path, break_on_error=True, break_on_warning=False, output_error=None, display_progress_bar=True, master_log_path=None)[source]¶ Validate JSON file as without ingestor.
Parameters: - data_path (str) – Path to data file.
- break_on_error (bool, optional) – Description
- break_on_warning (bool, optional) – Description
- output_error (None, optional) – Error output file path.
- display_progress_bar (bool, optional) – Description
- master_log_path (None, optional) – Description
- break_on_error –
- break_on_warning –
- display_progress_bar –
Raises: ValidationError
– Description
-
validate_textbook_link
(textbook_link)[source]¶ Validate textbook link.
Parameters: textbook_link (DotDict) – Textbook link object to validate. Raises: ValidationError
– Invalid textbook link.
-
validate_time_range
(start, end)[source]¶ Validate start time and end time.
There exists an unhandled case if the end time is midnight.
Parameters: Raises: ValidationError
– Time range is invalid.
-
static
validate_website
(url)[source]¶ Validate url by sending HEAD request and analyzing response.
Parameters: url (str) – URL to validate. Raises: ValidationError
– URL is invalid.
-
Logger¶
-
class
parsing.library.logger.
JSONColoredFormatter
(fmt=None, datefmt=None)[source]¶ Bases:
logging.Formatter
-
converter
()¶ - localtime([seconds]) -> (tm_year,tm_mon,tm_mday,tm_hour,tm_min,
- tm_sec,tm_wday,tm_yday,tm_isdst)
Convert seconds since the Epoch to a time tuple expressing local time. When ‘seconds’ is not passed in, convert the current time instead.
-
formatException
(ei)¶ Format and return the specified exception information as a string.
This default implementation just uses traceback.print_exception()
-
formatTime
(record, datefmt=None)¶ Return the creation time of the specified LogRecord as formatted text.
This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, the ISO8601 format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.
-
usesTime
()¶ Check if the format uses the creation time of the record.
-
-
class
parsing.library.logger.
JSONFormatter
(fmt=None, datefmt=None)[source]¶ Bases:
logging.Formatter
Simple JSON extension of Python logging.Formatter.
-
converter
()¶ - localtime([seconds]) -> (tm_year,tm_mon,tm_mday,tm_hour,tm_min,
- tm_sec,tm_wday,tm_yday,tm_isdst)
Convert seconds since the Epoch to a time tuple expressing local time. When ‘seconds’ is not passed in, convert the current time instead.
-
format
(record)[source]¶ Format record message.
Parameters: record (logging.LogRecord) – Description Returns: Prettified JSON string. Return type: str
-
formatException
(ei)¶ Format and return the specified exception information as a string.
This default implementation just uses traceback.print_exception()
-
formatTime
(record, datefmt=None)¶ Return the creation time of the specified LogRecord as formatted text.
This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, the ISO8601 format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.
-
usesTime
()¶ Check if the format uses the creation time of the record.
-
-
class
parsing.library.logger.
JSONStreamWriter
(obj, type_=<type 'list'>, level=0)[source]¶ Bases:
object
Context to stream JSON list to file.
-
BRACES
¶ TYPE – Open close brace definitions.
-
file
¶ dict – Current object being JSONified and streamed.
-
first
¶ bool – Indicator if first write has been done by streamer.
-
level
¶ int – Nesting level of streamer.
-
type_
¶ dict, list – Actual type class of streamer (dict or list).
Examples
>>> with JSONStreamWriter(sys.stdout, type_=dict) as streamer: ... streamer.write('a', 1) ... streamer.write('b', 2) ... streamer.write('c', 3) { "a": 1, "b": 2, "c": 3 } >>> with JSONStreamWriter(sys.stdout, type_=dict) as streamer: ... streamer.write('a', 1) ... with streamer.write('data', type_=list) as streamer2: ... streamer2.write({0:0, 1:1, 2:2}) ... streamer2.write({3:3, 4:'4'}) ... streamer.write('b', 2) { "a": 1, "data": [ { 0: 0, 1: 1, 2: 2 }, { 3: 3, 4: "4" } ], "b": 2 }
-
BRACES
= {<type 'dict'>: ('{', '}'), <type 'list'>: ('[', ']')}
-
write
(*args, **kwargs)[source]¶ Write to JSON in streaming fasion.
Picks either write_obj or write_key_value
Parameters: - *args – pass-through
- **kwargs – pass-through
Returns: return value of appropriate write function.
Raises: ValueError
– type_ is not of type list or dict.
-
Tracker¶
-
class
parsing.library.tracker.
NullTracker
(*args, **kwargs)[source]¶ Bases:
parsing.library.tracker.Tracker
Dummy tracker used as an interface placeholder.
-
BROADCAST_TYPES
= set(['TERM', 'STATS', 'YEAR', 'SCHOOL', 'MODE', 'TIME', 'DEPARTMENT', 'INSTRUCTOR'])¶
-
add_viewer
(viewer, name=None)¶ Add viewer to broadcast queue.
Parameters:
-
department
¶
-
end
()¶ End tracker and report to viewers.
-
get_viewer
(name)¶ Get viewer by name.
Will return arbitrary match if multiple viewers with same name exist.
Parameters: name (str) – Viewer name to get. Returns: Viewer instance if found, else None Return type: Viewer
-
has_viewer
(name)¶ Determine if name exists in viewers.
Parameters: name (str) – The name to check against. Returns: True if name in viewers else False Return type: bool
-
instructor
¶
-
mode
¶
-
remove_viewer
(name)¶ Remove all viewers that match name.
Parameters: name (str) – Viewer name to remove.
-
school
¶
-
start
()¶ Start timer of tracker object.
-
stats
¶
-
term
¶
-
time
¶
-
year
¶
-
-
class
parsing.library.tracker.
Tracker
[source]¶ Bases:
object
Tracks specified attributes and broadcasts to viewers.
@property attributes are defined for all BROADCAST_TYPES
-
BROADCAST_TYPES
= set(['TERM', 'STATS', 'YEAR', 'SCHOOL', 'MODE', 'TIME', 'DEPARTMENT', 'INSTRUCTOR'])¶
-
broadcast
(broadcast_type)[source]¶ Broadcast tracker update to viewers.
Parameters: broadcast_type (str) – message to go along broadcast bus. Raises: TrackerError
– if broadcast_type is not in BROADCAST_TYPE.
-
get_viewer
(name)[source]¶ Get viewer by name.
Will return arbitrary match if multiple viewers with same name exist.
Parameters: name (str) – Viewer name to get. Returns: Viewer instance if found, else None Return type: Viewer
-
has_viewer
(name)[source]¶ Determine if name exists in viewers.
Parameters: name (str) – The name to check against. Returns: True if name in viewers else False Return type: bool
-
-
exception
parsing.library.tracker.
TrackerError
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineError
Tracker error class.
-
args
¶
-
message
¶
-
Viewer¶
-
class
parsing.library.viewer.
Hoarder
[source]¶ Bases:
parsing.library.viewer.Viewer
Accumulate a log of some properties of the tracker.
-
receive
(tracker, broadcast_type)[source]¶ Receive an update from a tracker.
Ignore all broadcasts that are not TIME.
Parameters: - tracker (parsing.library.tracker.Tracker) – Tracker receiving update from.
- broadcast_type (str) – Broadcast message from tracker.
-
-
class
parsing.library.viewer.
StatProgressBar
(stat_format='', statistics=None)[source]¶ Bases:
parsing.library.viewer.Viewer
Command line progress bar viewer for data pipeline.
-
SWITCH_SIZE
= 100¶
-
-
class
parsing.library.viewer.
StatView
[source]¶ Bases:
parsing.library.viewer.Viewer
Keeps view of statistics of objects processed pipeline.
-
KINDS
¶ tuple – The kinds of objects that can be tracked. TODO - move this to a shared space w/Validator
-
LABELS
¶ tuple – The status labels of objects that can be tracked.
-
stats
¶ dict – The view itself of the stats.
-
KINDS
= ('course', 'section', 'meeting', 'textbook', 'evaluation', 'offering', 'textbook_link', 'eval')
-
LABELS
= ('valid', 'created', 'new', 'updated', 'total')
-
receive
(tracker, broadcast_type)[source]¶ Receive an update from a tracker.
Ignore all broadcasts that are not STATUS.
Parameters: - tracker (parsing.library.tracker.Tracker) – Tracker receiving update from.
- broadcast_type (str) – Broadcast message from tracker.
-
-
class
parsing.library.viewer.
TimeDistributionView
[source]¶ Bases:
parsing.library.viewer.Viewer
Viewer to analyze time distribution.
Calculates granularity and holds report and 12, 24hr distribution.
-
distribution
¶ dict – Contains counts of 12 and 24hr sightings.
-
granularity
¶ int – Time granularity of viewed times.
-
receive
(tracker, broadcast_type)[source]¶ Receive an update from a tracker.
Ignore all broadcasts that are not TIME.
Parameters: - tracker (parsing.library.tracker.Tracker) – Tracker receiving update from.
- broadcast_type (str) – Broadcast message from tracker.
-
-
class
parsing.library.viewer.
Timer
(format='%(elapsed)s', **kwargs)[source]¶ Bases:
progressbar.widgets.FormatLabel
,progressbar.widgets.TimeSensitiveWidgetBase
Custom timer created to take away ‘Elapsed Time’ string.
-
INTERVAL
= datetime.timedelta(0, 0, 100000)¶
-
check_size
(progress)¶
-
mapping
= {u'seconds': (u'seconds_elapsed', None), u'max': (u'max_value', None), u'value': (u'value', None), u'elapsed': (u'total_seconds_elapsed', <function format_time>), u'start': (u'start_time', None), u'finished': (u'end_time', None), u'last_update': (u'last_update_time', None)}¶
-
required_values
= []¶
-
-
class
parsing.library.viewer.
Viewer
[source]¶ Bases:
object
A view that is updated via a tracker object broadcast or report.
-
exception
parsing.library.viewer.
ViewerError
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineError
Viewer error class.
-
args
¶
-
message
¶
-
Digestor¶
-
class
parsing.library.digestor.
Absorb
(school, meta)[source]¶ Bases:
parsing.library.digestor.DigestionStrategy
Load valid data into Django db.
-
meta
¶ dict – Meta-information to use for DataUpdate object
-
school
¶ str
-
static
remove_offerings
(section_obj)[source]¶ Remove all offerings associated with a section.
Parameters: section_obj (Section) – Description
-
-
class
parsing.library.digestor.
Burp
(school, meta, output=None)[source]¶ Bases:
parsing.library.digestor.DigestionStrategy
Load valid data into Django db and output diff between input and db data.
-
absorb
¶ Vommit – Digestion strategy.
-
vommit
¶ Absorb – Digestion strategy.
-
-
class
parsing.library.digestor.
DigestionAdapter
(school, cached)[source]¶ Bases:
object
Converts JSON defititions to model compliant dictionay.
-
cache
¶ dict – Caches Django objects to avoid redundant queries.
-
school
¶ str – School code.
-
adapt_course
(course)[source]¶ Adapt course for digestion.
Parameters: course (dict) – course info Returns: Adapted course for django object. Return type: dict Raises: DigestionError
– course is None
-
adapt_meeting
(meeting, section_model=None)[source]¶ Adapt meeting to Django model.
Parameters: - meeting (TYPE) – Description
- section_model (None, optional) – Description
Yields: dict
Raises: DigestionError
– meeting is None.
-
adapt_section
(section, course_model=None)[source]¶ Adapt section to Django model.
Parameters: - section (TYPE) – Description
- course_model (None, optional) – Description
Returns: formatted section dictionary
Return type: Raises: DigestionError
– Description
-
-
exception
parsing.library.digestor.
DigestionError
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineError
Digestor error class.
-
args
¶
-
message
¶
-
-
class
parsing.library.digestor.
Digestor
(school, meta, tracker=<parsing.library.tracker.NullTracker object>)[source]¶ Bases:
object
Digestor in data pipeline.
-
adapter
¶ DigestionAdapter – Adapts
-
cache
¶ dict – Caches recently used Django objects to be used as foriegn keys.
-
data
¶ TYPE – The data to be digested.
-
meta
¶ dict – meta data associated with input data.
-
MODELS
¶ dict – mapping from object type to Django model class.
-
school
¶ str – School to digest.
-
strategy
¶ DigestionStrategy – Load and/or diff db depending on strategy
-
tracker
¶ parsing.library.tracker.Tracker – Description
-
MODELS
= {'textbook_link': <class 'timetable.models.TextbookLink'>, 'offering': <class 'timetable.models.Offering'>, 'section': <class 'timetable.models.Section'>, 'textbook': <class 'timetable.models.Textbook'>, 'course': <class 'timetable.models.Course'>, 'semester': <class 'timetable.models.Semester'>, 'evaluation': <class 'timetable.models.Evaluation'>}
-
digest_course
(course)[source]¶ Create course in database from info in json model.
Returns: django course model object
-
digest_meeting
(meeting, section_model=None)[source]¶ Create offering in database from info in model map.
Parameters: section_model – JSON course model object Return: Offerings as generator
-
digest_section
(section, course_model=None)[source]¶ Create section in database from info in model map.
Parameters: course_model – django course model object Keyword Arguments: clean (boolean) – removes course offerings associated with section if set Returns: django section model object
-
-
class
parsing.library.digestor.
Vommit
(output)[source]¶ Bases:
parsing.library.digestor.DigestionStrategy
Output diff between input and db data.
Exceptions¶
-
exception
parsing.library.exceptions.
ParseError
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineError
Parser error class.
-
args
¶
-
message
¶
-
-
exception
parsing.library.exceptions.
ParseJump
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineWarning
Parser exception used for control flow.
-
args
¶
-
message
¶
-
-
exception
parsing.library.exceptions.
ParseWarning
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineWarning
Parser warning class.
-
args
¶
-
message
¶
-
-
exception
parsing.library.exceptions.
PipelineError
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineException
Data-pipeline error class.
-
args
¶
-
message
¶
-
-
exception
parsing.library.exceptions.
PipelineException
(data, *args)[source]¶ Bases:
exceptions.Exception
Data-pipeline exception class.
- Should never be constructed directly. Use:
- PipelineError
- PipelineWarning
-
args
¶
-
message
¶
-
exception
parsing.library.exceptions.
PipelineWarning
(data, *args)[source]¶ Bases:
parsing.library.exceptions.PipelineException
,exceptions.UserWarning
Data-pipeline warning class.
-
args
¶
-
message
¶
-
Extractor¶
-
class
parsing.library.extractor.
Extraction
(key, container, patterns)¶ Bases:
tuple
-
container
¶ Alias for field number 1
-
count
(value) → integer -- return number of occurrences of value¶
-
index
(value[, start[, stop]]) → integer -- return first index of value.¶ Raises ValueError if the value is not present.
-
key
¶ Alias for field number 0
-
patterns
¶ Alias for field number 2
-
-
parsing.library.extractor.
extract_info_from_text
(text, inject=None, extractions=None, use_lowercase=True, splice_text=True)[source]¶ Attempt to extract info from text and put it into course object.
- NOTE: Currently unstable and unused as it introduces too many bugs.
- Might reconsider for later use.
Parameters: Returns: the text trimmed of extracted information
Return type:
Utils¶
-
class
parsing.library.utils.
DotDict
(dct)[source]¶ Bases:
dict
Dot notation access for dictionary.
Supports set, get, and delete.
Examples
>>> d = DotDict({'a': 1, 'b': 2, 'c': {'ca': 31}}) >>> d.a, d.b (1, 2) >>> d['a'] 1 >>> d['a'] = 3 >>> d.a, d['b'] (3, 2) >>> d.c.ca, d.c['ca'] (31, 31)
-
clear
() → None. Remove all items from D.¶
-
copy
() → a shallow copy of D¶
-
fromkeys
(S[, v]) → New dict with keys from S and values equal to v.¶ v defaults to None.
-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
has_key
(k) → True if D has a key k, else False¶
-
items
() → list of D's (key, value) pairs, as 2-tuples¶
-
iteritems
() → an iterator over the (key, value) items of D¶
-
iterkeys
() → an iterator over the keys of D¶
-
itervalues
() → an iterator over the values of D¶
-
keys
() → list of D's keys¶
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised
-
popitem
() → (k, v), remove and return some (key, value) pair as a¶ 2-tuple; but raise KeyError if D is empty.
-
setdefault
(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶
-
update
([E, ]**F) → None. Update D from dict/iterable E and F.¶ If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
-
values
() → list of D's values¶
-
viewitems
() → a set-like object providing a view on D's items¶
-
viewkeys
() → a set-like object providing a view on D's keys¶
-
viewvalues
() → an object providing a view on D's values¶
-
-
parsing.library.utils.
clean
(dirt)[source]¶ Recursively clean json-like object.
- list::
- remove None elements
- None on empty list
dict
::- filter out None valued key, value pairs
- None on empty dict
- basestring::
- convert unicode whitespace to ascii
- strip extra whitespace
- None on empty string
Parameters: dirt – the object to clean Returns: Cleaned dict, cleaned list, cleaned string, or pass-through.
-
parsing.library.utils.
dict_filter_by_dict
(a, b)[source]¶ Filter dictionary a by b.
dict or set Items or keys must be string or regex. Filters at arbitrary depth with regex matching.
Parameters: Returns: Filtered dictionary
Return type:
-
parsing.library.utils.
dir_to_dict
(path)[source]¶ Recursively create nested dictionary representing directory contents.
Parameters: path (str) – The path of the directory. Returns: Dictionary representation of the directory. Return type: dict
-
parsing.library.utils.
iterrify
(x)[source]¶ Create iterable object if not already.
Will wrap str types in extra iterable eventhough str is iterable.
Examples
>>> for i in iterrify(1): ... print(i) 1 >>> for i in iterrify([1]): ... print(i) 1 >>> for i in iterrify('hello'): ... print(i) 'hello'
-
parsing.library.utils.
make_list
(x=None)[source]¶ Wrap in list if not list already.
If input is None, will return empty list.
Parameters: x – Input. Returns: Input wrapped in list. Return type: list
-
parsing.library.utils.
pretty_json
(obj)[source]¶ Prettify object as JSON.
Parameters: obj (dict) – Serializable object to JSONify. Returns: Prettified JSON. Return type: str
-
parsing.library.utils.
safe_cast
(val, to_type, default=None)[source]¶ Attempt to cast to specified type or return default.
Parameters: - val – Value to cast.
- to_type – Type to cast to.
- default (None, optional) – Description
Returns: Description
Return type: to_type
-
parsing.library.utils.
time24
(time)[source]¶ Convert time to 24hr format.
Parameters: time (str) – time in reasonable format Returns: 24hr time in format hh:mm Return type: str Raises: ParseError
– Unparseable time input.
Parsing Models Documentation¶
-
class
parsing.models.
DataUpdate
(*args, **kwargs)[source]¶ Stores the date/time that the school’s data was last updated.
Scheduled updates occur when digestion into the database completes.
-
school
¶ CharField – the school code that was updated (e.g. jhu)
-
semester
¶ ForeignKey
toSemester
– the semester for the update
-
last_updated
¶ DateTimeField – the datetime last updated
-
reason
¶ CharField – the reason it was updated (default Scheduled Update)
-
update_type
¶ CharField – which field was updated
-
COURSES
¶ str – Update type.
-
EVALUATIONS
¶ str – Update type.
-
MISCELLANEOUS
¶ str – Update type.
-
TEXTBOOKS
¶ str – Update type.
-