cache

Caching utilities

feupy.cache.cache

A persistent dictionary-like object whose values are structured in the following way:

{
url0 : (timeout0, html0),
url1 : (timeout1, html1),
}

In which url is a string, timeout is an int or a float (which represents the “due by date” as seconds since epoch), and html is a string

Type:shelve.DbfilenameShelf or None (Initially, see load_cache())
feupy.cache.load_cache(flag='c', path=None)

Loads the cache from disk and stores it in the variable cache. If cache is different than None, the function will do nothing.

Parameters:

Note

Unless you intend to call load_cache() with non-default arguments, you don’t have to call this function. The other functions in this module check whether or not the cache has been loaded and will load the cache for you.

Example:

from feupy import cache
cache.load_cache()
feupy.cache.get_html(url, params={}, use_cache=True)

More or less functionally equivalent to requests.get(url, params).text, with the added benefit of a persistent cache with customizable html treatment and timeouts, depending on the url. If the result is already in cache and is valid, the function will just return the value from the cache instead of making a web request.

Parameters:
  • url (str) – The url of the html to be fetched
  • params (dict, optional) – the query portion of the url, should you want to include a query
  • use_cache (bool, optional) – If this value is set to True, the cache will be checked for the url. If the url is not found in the cache keys or has timed out, the function will get the html from the web, remove scripts and styles from the html, store it in cache, and finally return the html. Otherwise, if it’s set to False, the cache will not be checked
Returns:

A string which is the html from the requested page url

Note

The curricular units’ pages, along with the students’ and teachers’ htmls, are modified to reduce their memory footprint.

Note

If you know that you are going to make a crapton of requests beforehand, you probably should call get_html_async() first to populate the cache.

feupy.cache.reset()

Eliminates all entries from the cache

feupy.cache.remove_invalid_entries(urls=None)

Removes all the cache entries in urls that have timed out.

Parameters:urls (iterable(str) or None, optional) – The urls to be checked. If this argument is left untouched, all urls in the cache will be checked
feupy.cache.get_html_async(urls, n_workers=10, use_cache=True)

get_html(), but async, give or take.

Takes a list (or any iterable) of urls and returns a corresponding generator of htmls. The htmls have their scripts and styles removed and are stored in cache.

Parameters:
  • urls (iterable(str)) – The urls to be accessed
  • n_workers (int, optional) – The number of workers.
  • use_cache (bool, optional) – Attempts to use the cache if True, otherwise it will fetch from sigarra
Returns:

An str generator