Data Pipeline Documentation

Semester.ly’s data pipeline provides the infrastructure by which the database is filled with course information. Whether a given University offers an API or an online course catalogue, this pipeline lends developers an easy framework to work within to pull that information and save it in our Django Model format.

General System Workflow

  1. Pull HTML/JSON markup from a catalogue/API
  2. Map the fields of the mark up to the fields of our ingestor (by simply filling a python dictionary).
  3. The ingestor preprocesses the data, validates it, and writes it to JSON.
  4. Load the JSON into the database.

Note

This process happens automatically via Django/Celery Beat Periodict Tasks. You can learn more about these schedule tasks below (Scheduled Tasks).

Steps 1 and 2 are what we call parsing – an operation that is non-generalizable across all Universities. Often a new parser must be written. For more information on this, read Add a School.

Parsing Library Documentation

Base Parser

class parsing.library.base_parser.BaseParser(school, config=None, output_path=None, output_error_path=None, break_on_error=True, break_on_warning=False, skip_duplicates=True, display_progress_bar=False, validate=True, tracker=None)[source]

Bases: object

Abstract base parser for data pipeline parsers.

extractor

parsing.library.extractor.Extractor

ingestor

parsing.library.ingestor.Ingestor

requester

parsing.library.requester.Requester

school

str – School that parser is for.

end()[source]

Finish the parse.

start(**kwargs)[source]

Start the parse.

Parameters:**kwargs – expanded in child parser.

Requester

class parsing.library.requester.Requester[source]

Bases: object

get(url, params='', session=None, cookies=None, headers=None, verify=True, **kwargs)[source]

HTTP GET.

Parameters:
  • url (str) – url to query
  • params (dict) – payload dictionary of HTTP params (default None)
  • cookies (None, optional) – Description
  • headers (None, optional) – Description
  • verify (bool, optional) – Description
  • **kwargs – Description

Examples

TODO

http_request(do_http_request, type, parse=True, quiet=True, timeout=60, throttle=<function <lambda>>)[source]

Perform HTTP request.

Parameters:
  • do_http_request – function that returns request object
  • type (str) – GET, POST, HEAD
  • parse (bool, optional) – Specifies if return should be parsed. Autodetects parse type as html, xml, or json.
  • quiet (bool, optional) – suppress output if True (default True)
  • timeout (int, optional) – Description
  • throttle (lambda, optional) – Description
Returns:

if parse is False soup: soupified/jsonified text of http request

Return type:

request object

static markup(response)[source]

Autodects html, json, or xml format in response.

Parameters:response – raw response object
Returns:markedup response
new_user_agent()[source]
overwrite_header(new_headers)[source]
post(url, data='', params='', cookies=None, headers=None, verify=True, **kwargs)[source]

HTTP POST.

Parameters:
  • url (str) – url to query
  • data (str, optional) – HTTP form key-value dictionary
  • params (dict) – payload dictionary of HTTP params
  • cookies (None, optional) – Description
  • headers (None, optional) – Description
  • verify (bool, optional) – Description
  • **kwargs – Description

Ingestor

exception parsing.library.ingestor.IngestionError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Ingestor error class.

args
message
exception parsing.library.ingestor.IngestionWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Ingestor warning class.

args
message
class parsing.library.ingestor.Ingestor(config, output, break_on_error=True, break_on_warning=False, display_progress_bar=True, skip_duplicates=True, validate=True, tracker=<parsing.library.tracker.NullTracker object>)[source]

Bases: dict

Ingest parsing data into formatted json.

Mimics functionality of dict.

ALL_KEYS

set – Set of keys supported by Ingestor.

break_on_error

bool – Break/cont on errors.

break_on_warning

bool – Break/cont on warnings.

school

str – School code (e.g. jhu, gw, umich).

skip_duplicates

bool – Skip ingestion for repeated definitions.

tracker

library.tracker – Tracker object.

UNICODE_WHITESPACE

TYPE – regex that matches Unicode whitespace.

validate

bool – Enable/disable validation.

validator

library.validator – Validator instance.

ALL_KEYS = set(['school_subdivision_code', 'code', 'isbn', 'author', 'prerequisites', 'instr', 'meetings', 'year', 'time_end', 'homepage', 'offerings', 'course_name', 'semester', 'cost', 'coreqs', 'fees', 'num_credits', 'detail_url', 'campus', 'size', 'remaining_seats', 'loc', 'fee', 'time_start', 'descr', 'title', 'meeting_section', 'section', 'section_type', 'enrolment', 'kind', 'dept_name', 'same_as', 'score', 'location', 'school', 'dept', 'department_code', 'instructor_name', 'areas', 'type', 'geneds', 'website', 'sections', 'description', 'waitlist', 'corequisites', 'start_time', 'instructors', 'term', 'dept_code', 'credits', 'section_code', 'course', 'section_name', 'date', 'capacity', 'instructor', 'school_subdivision_name', 'day', 'department', 'instr_name', 'department_name', 'waitlist_size', 'dates', 'name', 'level', 'textbooks', 'final_exam', 'enrollment', 'required', 'days', 'summary', 'prereqs', 'instr_names', 'instrs', 'image_url', 'end_time', 'time', 'cores', 'course_code', 'where', 'exclusions'])
clear() → None. Remove all items from D.
copy() → a shallow copy of D
end()[source]

Finish ingesting.

Close i/o, clear internal state, write meta info

fromkeys(S[, v]) → New dict with keys from S and values equal to v.

v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
ingest_course()[source]

Create course json from info in model map.

Returns:course
Return type:dict
ingest_eval()[source]

Create evaluation json object.

Returns:eval
Return type:dict
ingest_meeting(section, clean_only=False)[source]

Create meeting ingested json map.

Parameters:section (dict) – validated section object
Returns:meeting
Return type:dict
ingest_section(course)[source]

Create section json object from info in model map.

Parameters:course (dict) – validated course object
Returns:section
Return type:dict
ingest_textbook()[source]

Create textbook json object.

Returns:textbook
Return type:dict

Create textbook link json object.

Parameters:section (None, dict, optional) – Description
Returns:textbook link.
Return type:dict
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys() → list of D's keys
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F) → None. Update D from dict/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values
viewitems() → a set-like object providing a view on D's items
viewkeys() → a set-like object providing a view on D's keys
viewvalues() → an object providing a view on D's values

Validator

exception parsing.library.validator.MultipleDefinitionsWarning(data, *args)[source]

Bases: parsing.library.validator.ValidationWarning

Duplicated key in data definition.

args
message
exception parsing.library.validator.ValidationError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Validator error class.

args
message
exception parsing.library.validator.ValidationWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Validator warning class.

args
message
class parsing.library.validator.Validator(config, tracker=None, relative=True)[source]

Validation engine in parsing data pipeline.

config

DotDict – Loaded config.json.

course_code_regex

re – Regex to match course code.

kind_to_validation_function

dict – Map kind to validation function defined within this class.

KINDS

set – Kinds of objects that validator validates.

relative

bool – Enforce relative ordering in validation.

seen

dict – Running monitor of seen courses and sections

tracker

parsing.library.tracker.Tracker

KINDS = set(['textbook_link', 'datalist', 'meeting', 'section', 'textbook', 'course', 'config', 'eval', 'directory', 'instructor', 'final_exam'])
static file_to_json(path, allow_duplicates=False)[source]

Load file pointed to by path into json object dictionary.

Parameters:
  • path (str) –
  • allow_duplicates (bool, optional) – Allow duplicate keys in JSON.
Returns:

JSON-compliant dictionary.

Return type:

dict

classmethod load_schemas(schema_path=None)[source]

Load JSON validation schemas.

NOTE: Will load schemas as static variable (i.e. once per definition),
unless schema_path is specifically defined.
Parameters:schema_path (None, str, optional) – Override default schema_path
static schema_validate(data, schema, resolver=None)[source]

Validate data object with JSON schema alone.

Parameters:
  • data (dict) – Data object to validate.
  • schema – JSON schema to validate against.
  • resolver (None, optional) – JSON Schema reference resolution.
Raises:

jsonschema.exceptions.ValidationError – Invalid object.

validate(data, transact=True)[source]

Validation entry/dispatcher.

Parameters:data (list, dict) – Data to validate.
validate_course(course)[source]

Validate course.

Parameters:

course (DotDict) – Course object to validate.

Raises:
validate_directory(directory)[source]

Validate directory.

Parameters:directory (str, dict) – Directory to validate. May be either path or object.
Raises:ValidationError – encapsulated IOError
validate_eval(course_eval)[source]

Validate evaluation object.

Parameters:course_eval (DotDict) – Evaluation to validate.
Raises:ValidationError – Invalid evaulation.
validate_final_exam(final_exam)[source]

Validate final exam.

NOTE: currently unused.

Parameters:final_exam (DotDict) – Final Exam object to validate.
Raises:ValidationError – Invalid final exam.
validate_instructor(instructor)[source]

Validate instructor object.

Parameters:instructor (DotDict) – Instructor object to validate.
Raises:ValidationError – Invalid instructor.
validate_location(location)[source]

Validate location.

Parameters:location (DotDict) – Location object to validate.
Raises:ValidationWarning – Invalid location.
validate_meeting(meeting)[source]

Validate meeting object.

Parameters:

meeting (DotDict) – Meeting object to validate.

Raises:
validate_section(section)[source]

Validate section object.

Parameters:

section (DotDict) – Section object to validate.

Raises:
validate_self_contained(data_path, break_on_error=True, break_on_warning=False, output_error=None, display_progress_bar=True, master_log_path=None)[source]

Validate JSON file as without ingestor.

Parameters:
  • data_path (str) – Path to data file.
  • break_on_error (bool, optional) – Description
  • break_on_warning (bool, optional) – Description
  • output_error (None, optional) – Error output file path.
  • display_progress_bar (bool, optional) – Description
  • master_log_path (None, optional) – Description
  • break_on_error
  • break_on_warning
  • display_progress_bar
Raises:

ValidationError – Description

Validate textbook link.

Parameters:textbook_link (DotDict) – Textbook link object to validate.
Raises:ValidationError – Invalid textbook link.
validate_time_range(start, end)[source]

Validate start time and end time.

There exists an unhandled case if the end time is midnight.

Parameters:
  • start (str) – Start time.
  • end (str) – End time.
Raises:

ValidationError – Time range is invalid.

static validate_website(url)[source]

Validate url by sending HEAD request and analyzing response.

Parameters:url (str) – URL to validate.
Raises:ValidationError – URL is invalid.

Logger

class parsing.library.logger.JSONColoredFormatter(fmt=None, datefmt=None)[source]

Bases: logging.Formatter

converter()
localtime([seconds]) -> (tm_year,tm_mon,tm_mday,tm_hour,tm_min,
tm_sec,tm_wday,tm_yday,tm_isdst)

Convert seconds since the Epoch to a time tuple expressing local time. When ‘seconds’ is not passed in, convert the current time instead.

format(record)[source]
formatException(ei)

Format and return the specified exception information as a string.

This default implementation just uses traceback.print_exception()

formatTime(record, datefmt=None)

Return the creation time of the specified LogRecord as formatted text.

This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, the ISO8601 format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.

usesTime()

Check if the format uses the creation time of the record.

class parsing.library.logger.JSONFormatter(fmt=None, datefmt=None)[source]

Bases: logging.Formatter

Simple JSON extension of Python logging.Formatter.

converter()
localtime([seconds]) -> (tm_year,tm_mon,tm_mday,tm_hour,tm_min,
tm_sec,tm_wday,tm_yday,tm_isdst)

Convert seconds since the Epoch to a time tuple expressing local time. When ‘seconds’ is not passed in, convert the current time instead.

format(record)[source]

Format record message.

Parameters:record (logging.LogRecord) – Description
Returns:Prettified JSON string.
Return type:str
formatException(ei)

Format and return the specified exception information as a string.

This default implementation just uses traceback.print_exception()

formatTime(record, datefmt=None)

Return the creation time of the specified LogRecord as formatted text.

This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, the ISO8601 format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.

usesTime()

Check if the format uses the creation time of the record.

class parsing.library.logger.JSONStreamWriter(obj, type_=<type 'list'>, level=0)[source]

Bases: object

Context to stream JSON list to file.

BRACES

TYPE – Open close brace definitions.

file

dict – Current object being JSONified and streamed.

first

bool – Indicator if first write has been done by streamer.

level

int – Nesting level of streamer.

type_

dict, list – Actual type class of streamer (dict or list).

Examples

>>> with JSONStreamWriter(sys.stdout, type_=dict) as streamer:
...     streamer.write('a', 1)
...     streamer.write('b', 2)
...     streamer.write('c', 3)
{
    "a": 1,
    "b": 2,
    "c": 3
}
>>> with JSONStreamWriter(sys.stdout, type_=dict) as streamer:
...     streamer.write('a', 1)
...     with streamer.write('data', type_=list) as streamer2:
...         streamer2.write({0:0, 1:1, 2:2})
...         streamer2.write({3:3, 4:'4'})
...     streamer.write('b', 2)
{
    "a": 1,
    "data":
    [
        {
            0: 0,
            1: 1,
            2: 2
        },
        {
            3: 3,
            4: "4"
        }
    ],
    "b": 2
}
BRACES = {<type 'dict'>: ('{', '}'), <type 'list'>: ('[', ']')}
enter()[source]

Wrapper for self.__enter__.

exit()[source]

Wrapper for self.__exit__.

write(*args, **kwargs)[source]

Write to JSON in streaming fasion.

Picks either write_obj or write_key_value

Parameters:
  • *args – pass-through
  • **kwargs – pass-through
Returns:

return value of appropriate write function.

Raises:

ValueErrortype_ is not of type list or dict.

write_key_value(key, value=None, type_=<type 'list'>)[source]

Write key, value pair as string to file.

If value is not given, returns new list streamer.

Parameters:
  • key (str) – Description
  • value (str, dict, None, optional) – Description
  • type (str, optional) – Description
Returns:

None if value is given, else new JSONStreamWriter

write_obj(obj)[source]

Write obj as JSON to file.

Parameters:obj (dict) – Serializable obj to write to file.
parsing.library.logger.colored_json(j)[source]

Tracker

class parsing.library.tracker.NullTracker(*args, **kwargs)[source]

Bases: parsing.library.tracker.Tracker

Dummy tracker used as an interface placeholder.

BROADCAST_TYPES = set(['TERM', 'STATS', 'YEAR', 'SCHOOL', 'MODE', 'TIME', 'DEPARTMENT', 'INSTRUCTOR'])
add_viewer(viewer, name=None)

Add viewer to broadcast queue.

Parameters:
  • viewer (Viewer) – Viewer to add.
  • name (None, str, optional) – Name the viewer.
broadcast(broadcast_type)[source]

Do nothing.

department
end()

End tracker and report to viewers.

get_viewer(name)

Get viewer by name.

Will return arbitrary match if multiple viewers with same name exist.

Parameters:name (str) – Viewer name to get.
Returns:Viewer instance if found, else None
Return type:Viewer
has_viewer(name)

Determine if name exists in viewers.

Parameters:name (str) – The name to check against.
Returns:True if name in viewers else False
Return type:bool
instructor
mode
remove_viewer(name)

Remove all viewers that match name.

Parameters:name (str) – Viewer name to remove.
report()[source]

Do nothing.

school
start()

Start timer of tracker object.

stats
term
time
year
class parsing.library.tracker.Tracker[source]

Bases: object

Tracks specified attributes and broadcasts to viewers.

@property attributes are defined for all BROADCAST_TYPES

BROADCAST_TYPES = set(['TERM', 'STATS', 'YEAR', 'SCHOOL', 'MODE', 'TIME', 'DEPARTMENT', 'INSTRUCTOR'])
add_viewer(viewer, name=None)[source]

Add viewer to broadcast queue.

Parameters:
  • viewer (Viewer) – Viewer to add.
  • name (None, str, optional) – Name the viewer.
broadcast(broadcast_type)[source]

Broadcast tracker update to viewers.

Parameters:broadcast_type (str) – message to go along broadcast bus.
Raises:TrackerError – if broadcast_type is not in BROADCAST_TYPE.
end()[source]

End tracker and report to viewers.

get_viewer(name)[source]

Get viewer by name.

Will return arbitrary match if multiple viewers with same name exist.

Parameters:name (str) – Viewer name to get.
Returns:Viewer instance if found, else None
Return type:Viewer
has_viewer(name)[source]

Determine if name exists in viewers.

Parameters:name (str) – The name to check against.
Returns:True if name in viewers else False
Return type:bool
remove_viewer(name)[source]

Remove all viewers that match name.

Parameters:name (str) – Viewer name to remove.
report()[source]

Notify viewers that tracker has ended.

start()[source]

Start timer of tracker object.

exception parsing.library.tracker.TrackerError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Tracker error class.

args
message

Viewer

class parsing.library.viewer.ETAProgressBar[source]

Bases: parsing.library.viewer.Viewer

receive(tracker, broadcast_type)[source]
report(tracker)[source]

Do nothing.

class parsing.library.viewer.Hoarder[source]

Bases: parsing.library.viewer.Viewer

Accumulate a log of some properties of the tracker.

receive(tracker, broadcast_type)[source]

Receive an update from a tracker.

Ignore all broadcasts that are not TIME.

Parameters:
report(tracker)[source]

Do nothing.

schools

Get schools attribute (i.e. self.schools).

Returns:Value of schools storage value.
Return type:dict
class parsing.library.viewer.StatProgressBar(stat_format='', statistics=None)[source]

Bases: parsing.library.viewer.Viewer

Command line progress bar viewer for data pipeline.

SWITCH_SIZE = 100
receive(tracker, broadcast_type)[source]

Incremental update to progress bar.

report(tracker)[source]

Do nothing.

class parsing.library.viewer.StatView[source]

Bases: parsing.library.viewer.Viewer

Keeps view of statistics of objects processed pipeline.

KINDS

tuple – The kinds of objects that can be tracked. TODO - move this to a shared space w/Validator

LABELS

tuple – The status labels of objects that can be tracked.

stats

dict – The view itself of the stats.

KINDS = ('course', 'section', 'meeting', 'textbook', 'evaluation', 'offering', 'textbook_link', 'eval')
LABELS = ('valid', 'created', 'new', 'updated', 'total')
receive(tracker, broadcast_type)[source]

Receive an update from a tracker.

Ignore all broadcasts that are not STATUS.

Parameters:
report(tracker=None)[source]

Dump stats.

class parsing.library.viewer.TimeDistributionView[source]

Bases: parsing.library.viewer.Viewer

Viewer to analyze time distribution.

Calculates granularity and holds report and 12, 24hr distribution.

distribution

dict – Contains counts of 12 and 24hr sightings.

granularity

int – Time granularity of viewed times.

receive(tracker, broadcast_type)[source]

Receive an update from a tracker.

Ignore all broadcasts that are not TIME.

Parameters:
report(tracker)[source]

Do nothing.

class parsing.library.viewer.Timer(format='%(elapsed)s', **kwargs)[source]

Bases: progressbar.widgets.FormatLabel, progressbar.widgets.TimeSensitiveWidgetBase

Custom timer created to take away ‘Elapsed Time’ string.

INTERVAL = datetime.timedelta(0, 0, 100000)
check_size(progress)
mapping = {u'seconds': (u'seconds_elapsed', None), u'max': (u'max_value', None), u'value': (u'value', None), u'elapsed': (u'total_seconds_elapsed', <function format_time>), u'start': (u'start_time', None), u'finished': (u'end_time', None), u'last_update': (u'last_update_time', None)}
required_values = []
class parsing.library.viewer.Viewer[source]

Bases: object

A view that is updated via a tracker object broadcast or report.

receive(tracker, broadcast_type)[source]

Incremental updates of tracking info.

Parameters:
  • tracker (Tracker) – Tracker instance.
  • broadcast_type (str) – Broadcast type emitted by tracker.
report(tracker)[source]

Report all tracked info.

Parameters:tracker (Tracker) – Tracker instance.
exception parsing.library.viewer.ViewerError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Viewer error class.

args
message

Digestor

class parsing.library.digestor.Absorb(school, meta)[source]

Bases: parsing.library.digestor.DigestionStrategy

Load valid data into Django db.

meta

dict – Meta-information to use for DataUpdate object

school

str

classmethod digest_section(parmams, clean=True)[source]
static remove_offerings(section_obj)[source]

Remove all offerings associated with a section.

Parameters:section_obj (Section) – Description
static remove_section(section_code, course_obj)[source]

Remove section specified from database.

Parameters:
  • section (dict) – Description
  • course_obj (Course) – Section part of this course.
wrap_up()[source]

Update time updated for school at wrap_up of parse.

class parsing.library.digestor.Burp(school, meta, output=None)[source]

Bases: parsing.library.digestor.DigestionStrategy

Load valid data into Django db and output diff between input and db data.

absorb

Vommit – Digestion strategy.

vommit

Absorb – Digestion strategy.

wrap_up()[source]
class parsing.library.digestor.DigestionAdapter(school, cached)[source]

Bases: object

Converts JSON defititions to model compliant dictionay.

cache

dict – Caches Django objects to avoid redundant queries.

school

str – School code.

adapt_course(course)[source]

Adapt course for digestion.

Parameters:course (dict) – course info
Returns:Adapted course for django object.
Return type:dict
Raises:DigestionError – course is None
adapt_meeting(meeting, section_model=None)[source]

Adapt meeting to Django model.

Parameters:
  • meeting (TYPE) – Description
  • section_model (None, optional) – Description
Yields:

dict

Raises:

DigestionError – meeting is None.

adapt_section(section, course_model=None)[source]

Adapt section to Django model.

Parameters:
  • section (TYPE) – Description
  • course_model (None, optional) – Description
Returns:

formatted section dictionary

Return type:

dict

Raises:

DigestionError – Description

adapt_textbook(textbook)[source]

Adapt textbook to model dictionary.

Parameters:textbook (dict) – validated textbook.
Returns:Description
Return type:dict

Adapt textbook link to model dictionary.

Parameters:
  • textbook_link (dict) – validated
  • textbook_model (model, None, optional) –
  • section_model (model, None, optional) –
Yields:

dict – model compliant

exception parsing.library.digestor.DigestionError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Digestor error class.

args
message
class parsing.library.digestor.DigestionStrategy[source]

Bases: object

wrap_up()[source]

Do whatever needs to be done to wrap_up digestion session.

class parsing.library.digestor.Digestor(school, meta, tracker=<parsing.library.tracker.NullTracker object>)[source]

Bases: object

Digestor in data pipeline.

adapter

DigestionAdapter – Adapts

cache

dict – Caches recently used Django objects to be used as foriegn keys.

data

TYPE – The data to be digested.

meta

dict – meta data associated with input data.

MODELS

dict – mapping from object type to Django model class.

school

str – School to digest.

strategy

DigestionStrategy – Load and/or diff db depending on strategy

tracker

parsing.library.tracker.Tracker – Description

MODELS = {'textbook_link': <class 'timetable.models.TextbookLink'>, 'offering': <class 'timetable.models.Offering'>, 'section': <class 'timetable.models.Section'>, 'textbook': <class 'timetable.models.Textbook'>, 'course': <class 'timetable.models.Course'>, 'semester': <class 'timetable.models.Semester'>, 'evaluation': <class 'timetable.models.Evaluation'>}
digest(data, diff=True, load=True, output=None)[source]

Digest data.

digest_course(course)[source]

Create course in database from info in json model.

Returns:django course model object
digest_meeting(meeting, section_model=None)[source]

Create offering in database from info in model map.

Parameters:section_model – JSON course model object

Return: Offerings as generator

digest_section(section, course_model=None)[source]

Create section in database from info in model map.

Parameters:course_model – django course model object
Keyword Arguments:
 clean (boolean) – removes course offerings associated with section if set
Returns:django section model object
digest_textbook(textbook)[source]

Digest textbook.

Parameters:textbook (dict) –

Digest textbook link.

Parameters:
wrap_up()[source]
class parsing.library.digestor.Vommit(output)[source]

Bases: parsing.library.digestor.DigestionStrategy

Output diff between input and db data.

diff(kind, inmodel, dbmodel, hide_defaults=True)[source]

Create a diff between input and existing model.

Parameters:
  • kind (str) – kind of object to diff.
  • inmodel (model) – Description
  • dbmodel (model) – Description
  • hide_defaults (bool, optional) – hide values that are defaulted into db
Returns:

Diff

Return type:

dict

static get_model_defaults()[source]
remove_defaulted_keys(kind, dct)[source]
wrap_up()[source]

Exceptions

exception parsing.library.exceptions.ParseError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Parser error class.

args
message
exception parsing.library.exceptions.ParseJump(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Parser exception used for control flow.

args
message
exception parsing.library.exceptions.ParseWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Parser warning class.

args
message
exception parsing.library.exceptions.PipelineError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineException

Data-pipeline error class.

args
message
exception parsing.library.exceptions.PipelineException(data, *args)[source]

Bases: exceptions.Exception

Data-pipeline exception class.

Should never be constructed directly. Use:
  • PipelineError
  • PipelineWarning
args
message
exception parsing.library.exceptions.PipelineWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineException, exceptions.UserWarning

Data-pipeline warning class.

args
message

Extractor

class parsing.library.extractor.Extraction(key, container, patterns)

Bases: tuple

container

Alias for field number 1

count(value) → integer -- return number of occurrences of value
index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 0

patterns

Alias for field number 2

parsing.library.extractor.extract_info_from_text(text, inject=None, extractions=None, use_lowercase=True, splice_text=True)[source]

Attempt to extract info from text and put it into course object.

NOTE: Currently unstable and unused as it introduces too many bugs.
Might reconsider for later use.
Parameters:
  • text (str) – text to attempt to extract information from
  • extractions (None, optional) – Description
  • inject (None, optional) – Description
  • use_lowercase (bool, optional) – Description
Returns:

the text trimmed of extracted information

Return type:

str

Utils

class parsing.library.utils.DotDict(dct)[source]

Bases: dict

Dot notation access for dictionary.

Supports set, get, and delete.

Examples

>>> d = DotDict({'a': 1, 'b': 2, 'c': {'ca': 31}})
>>> d.a, d.b
(1, 2)
>>> d['a']
1
>>> d['a'] = 3
>>> d.a, d['b']
(3, 2)
>>> d.c.ca, d.c['ca']
(31, 31)
as_dict()[source]

Return pure dictionary representation of self.

clear() → None. Remove all items from D.
copy() → a shallow copy of D
fromkeys(S[, v]) → New dict with keys from S and values equal to v.

v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys() → list of D's keys
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F) → None. Update D from dict/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values
viewitems() → a set-like object providing a view on D's items
viewkeys() → a set-like object providing a view on D's keys
viewvalues() → an object providing a view on D's values
class parsing.library.utils.SimpleNamespace(**kwargs)[source]
parsing.library.utils.clean(dirt)[source]

Recursively clean json-like object.

list::
  • remove None elements
  • None on empty list
dict::
  • filter out None valued key, value pairs
  • None on empty dict
basestring::
  • convert unicode whitespace to ascii
  • strip extra whitespace
  • None on empty string
Parameters:dirt – the object to clean
Returns:Cleaned dict, cleaned list, cleaned string, or pass-through.
parsing.library.utils.dict_filter_by_dict(a, b)[source]

Filter dictionary a by b.

dict or set Items or keys must be string or regex. Filters at arbitrary depth with regex matching.

Parameters:
  • a (dict) – Dictionary to filter.
  • b (dict) – Dictionary to filter by.
Returns:

Filtered dictionary

Return type:

dict

parsing.library.utils.dict_filter_by_list(a, b)[source]
parsing.library.utils.dir_to_dict(path)[source]

Recursively create nested dictionary representing directory contents.

Parameters:path (str) – The path of the directory.
Returns:Dictionary representation of the directory.
Return type:dict
parsing.library.utils.iterrify(x)[source]

Create iterable object if not already.

Will wrap str types in extra iterable eventhough str is iterable.

Examples

>>> for i in iterrify(1):
...     print(i)
1
>>> for i in iterrify([1]):
...     print(i)
1
>>> for i in iterrify('hello'):
...     print(i)
'hello'
parsing.library.utils.make_list(x=None)[source]

Wrap in list if not list already.

If input is None, will return empty list.

Parameters:x – Input.
Returns:Input wrapped in list.
Return type:list
parsing.library.utils.pretty_json(obj)[source]

Prettify object as JSON.

Parameters:obj (dict) – Serializable object to JSONify.
Returns:Prettified JSON.
Return type:str
parsing.library.utils.safe_cast(val, to_type, default=None)[source]

Attempt to cast to specified type or return default.

Parameters:
  • val – Value to cast.
  • to_type – Type to cast to.
  • default (None, optional) – Description
Returns:

Description

Return type:

to_type

parsing.library.utils.time24(time)[source]

Convert time to 24hr format.

Parameters:time (str) – time in reasonable format
Returns:24hr time in format hh:mm
Return type:str
Raises:ParseError – Unparseable time input.
parsing.library.utils.titlize(name)[source]

Format name into pretty title.

Will uppercase roman numerals. Will lowercase conjuctions and prepositions.

Examples

>>> titlize('BIOLOGY OF CANINES II')
Biology of Canines II
parsing.library.utils.update(d, u)[source]

Recursive update to dictionary w/o overwriting upper levels.

Examples

>>> update({0: {1: 2, 3: 4}}, {1: 2, 0: {5: 6, 3: 7}})
{0: {1: 2}}

Parsing Models Documentation

class parsing.models.DataUpdate(*args, **kwargs)[source]

Stores the date/time that the school’s data was last updated.

Scheduled updates occur when digestion into the database completes.

school

CharField – the school code that was updated (e.g. jhu)

semester

ForeignKey to Semester – the semester for the update

last_updated

DateTimeField – the datetime last updated

reason

CharField – the reason it was updated (default Scheduled Update)

update_type

CharField – which field was updated

UPDATE_TYPE

tuple of tuple – Update types allowed.

COURSES

str – Update type.

EVALUATIONS

str – Update type.

MISCELLANEOUS

str – Update type.

TEXTBOOKS

str – Update type.

Scheduled Tasks