Redips

Redips is a Python based web crawler. Compatible with Python 2.7.X

Redips can be used to generate:

An index that maps every word Redips encounters with a set of URLs it finds the word on.
A graph that maps every URL Redips encounters to a list of URLs on that page.

Index: {keyword1 : set([URL1, URL2,...]), keyword2 : set([URL3, URL4,...]), ...}

Graph: {URL1 : [outlink1, outlink2,...], URL2 : [outlink3, outlink4,...], ...}

To use:

>>> from redips import Redips
>>> redips = Redips('http://github.com')
>>> redips.crawl()

The Redips constructor takes 2 arguments both of which are optional

>>> from redips import Redips
>>> redips = Redips(seed_url='http://foo.bar', pickle='abc.pickle')

URLs can also be added to the list of URLs to be crawled as:

>>> redips.add_url('http://google.com')
>>> redips.to_crawl
['http://google.com']
>>>

To save the state of a Redips object:

>>> redips.save()

To load a previously pickled Redips object:

>>> from redips import *
>>> redips = load('redips.pickle') # Or whichever pickle file your crawler is saved in
>>> redips.crawl() # Resume crawling from where it left

To specify a file other than the default redips.pickle file for pickling the crawler:

>>> from redips import *
>>> redips = Redips(pickle='foo.pickle')

To crawl a single page:

>>> redips.crawl_page('http://foo.bar')

To access the index:

>>> redips.get_index()

To access the graph:

>>> redips.get_graph()

To reset the list of URLs to crawl:

>>> redips.reset_to_crawl()

To merge the index with another index:

>>> redips.merge_index(index)

To merge the graph with another graph:

>>> redips.merge_graph(graph)

To merge the data of another Redips object with your Redips object:

>>> redips.merge(anotherRedips)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
README.md		README.md
redips.py		redips.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Redips

About

Releases

Packages

Languages

License

adesor/Redips

Folders and files

Latest commit

History

Repository files navigation

Redips

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages