- Fixed
- Issue printing CLI object to stderr.
- Fixed
- Issue with CLI docs.
- Fixed
- Parse xml.gz sitemap that are not actually zipped, and log the error.
- Fixed
- Ensure meta crawl data included in URLs crawled by following links.
- Added
- Enable the
meta
parameter for the crawl function for: arbitrary metadata, custom request headers, and 3rd party plugins like playwright.
- Enable the
- Changed
- Raise an error when supplying a custom log format with supplying fields.
- Fixed
- Make file path for
emoji_df
relative to advertools__path__
. - Allow the extension
.jsonl
for crawling.
- Make file path for
- Added
- Enable supplying request headers in
sitemap_to_df
, contributed by @joejoinerr - New function
crawlytics.compare
for comparing two crawls. - New function
crawlytics.running_crawls
for getting data on currently running crawl jobs (*NIX only for now). - New parameter
date_format
tologs_to_df
for custom date formats.
- Enable supplying request headers in
- Changed
- Removed the relatedSite parameter from
serp_goog
- deprecated. - Update emoji regex and functionality to v15.1.
- Removed the relatedSite parameter from
- Fixed
- Use int64 instead of int for YouTube count columns, contributed by @DanielP77
- Fixed
- Use
pd.NA
instead ofnp.nan
for empty values inurl_to_df
.
- Use
- Changed
- Use a different XPath expression for body_text while crawling.
- Changed
- Allow
sitemap_to_df
to work on offline sitemaps.
- Allow
- Fixed
- Preserve the order of supplied URLs in the output of
url_to_df
.
- Preserve the order of supplied URLs in the output of
- Added
- New module
crawlytics
for analyzing crawl DataFrames. Includes functions to analyze crawl DataFrames (images
,redirects
, andlinks
), as well as functions to handle large files (jl_to_parquet
,jl_subset
,parquet_columns
). - New
encoding
option forlogs_to_df
. - Option to save the output of
url_to_df
to a parquet file.
- New module
- Changed
- Remove requirement to delete existing log output and error files if they exist. The function will now overwrite them if they do.
- Autothrottling is enabled by default in
crawl_headers
to minimize being blocked.
- Fixed
- Always get absolute path for img src while crawling.
- Handle NA src attributes when extracting images.
- Change fillna(method="ffill") to ffill for
url_to_df
.
- Added
- Initial experimental functionality for
crawl_images
.
- Initial experimental functionality for
- Changed
- Enable autothrottling by default for
crawl_headers
.
- Enable autothrottling by default for
- Fixed - Make img attributes consistent in length, and support all attributes.
- Changed
- Allow optional trailing space in log files (contributed by @andypayne)
- Fixed
- Replace newlines with spaces while parsing JSON-LD which was causing errors in some cases.
- Added
- Crawling recipe for how to use the
DEFAULT_REQUEST_HEADERS
to change the default headers.
- Crawling recipe for how to use the
- Changed
- Split long lists of URL while crawling regardless of the
follow_links
parameter
- Split long lists of URL while crawling regardless of the
- Fixed
- Clarify that while authenticating for Twitter only
app_key
andapp_secret
are required, with the option to provideoauth_token
andoauth_token_secret
if/when needed.
- Clarify that while authenticating for Twitter only
- Added
- Command line interface with most functions
- Make documentation interactive for most pages using
thebe-sphinx
- Changed
- Use np.nan wherever there are missing values in
url_to_df
- Use np.nan wherever there are missing values in
- Fixed
- Don't remove double quotes from etags when downloading XML sitemaps
- Replace instances of
pd.DataFrame.append
withpd.concat
, which is depracated. - Replace empty values with np.nan for the size column in
logs_to_df
- Added
- New function
crawl_headers
: A crawler that only makes HEAD requests to a known list of URLs. - New function
reverse_dns_lookup
: A way to get host information for a large list of IP addresses concurrently. - New options for crawling: exclude_url_params, include_url_params, exclude_url_regex, and include_url_regex for controlling which links to follow while crawling.
- New function
- Fixed
- Any
custom_settings
options given to thecrawl
function that were defined using a dictionary can now be set without issues. There was an issue if those options were not strings.
- Any
- Changed
- The skip_url_params option was removed and replaced with the more
versatile
exclude_url_params
, which accepts eitherTrue
or a list of URL parameters to exclude while following links.
- The skip_url_params option was removed and replaced with the more
versatile
- Fixed
- Crawler stops when provided with bad URLs in list mode.
- Added
- New function
logs_to_df
: Convert a log file of any non-JSON format into a pandas DataFrame and save it to a parquet file. This also compresses the file to a much smaller size. - Crawler extracts all available
img
attributes: 'alt', 'crossorigin', 'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes', 'src', 'srcset', 'usemap', and 'width' (excluding global HTML attributes likestyle
anddraggable
). - New parameter for the
crawl
functionskip_url_params
: Defaults to False, consistent with previous behavior, with the ability to not follow/crawl links containing any URL parameters. - New column for
url_to_df
"last_dir": Extract the value in the last directory for each of the URLs.
- New function
- Changed
- Query parameter columns in
url_to_df
DataFrame are now sorted by how full the columns are (the percentage of values that are not NA)
- Query parameter columns in
- Added
- The nofollow attribute for nav, header, and footer links.
- Fixed
- Timeout error while downloading robots.txt files.
- Make extracting nav, header, and footer links consistent with all links.
- Added
- New parameter recursive for
sitemap_to_df
to control whether or not to get all sub sitemaps (default), or to only get the current (sitemapindex) one. - New columns for
sitemap_to_df
:sitemap_size_mb
(1 MB = 1,024x1,024 bytes), andsitemap_last_modified
andetag
(if available). - Option to request multiple robots.txt files with
robotstxt_to_df
. - Option to save downloaded robots DataFrame(s) to a file with
robotstxt_to_df
using the new parameteroutput_file
. - Two new columns for
robotstxt_to_df
:robotstxt_last_modified
andetag
(if available). - Raise ValueError in
crawl
ifcss_selectors
orxpath_selectors
contain any of the default crawl column headers - New XPath code recipes for custom extraction.
- New function
crawllogs_to_df
which converts crawl logs to a DataFrame provided they were saved while using thecrawl
function. - New columns in
crawl
: viewport, charset, all h headings (whichever is available), nav, header and footer links and text, if available. - Crawl errors don't stop crawling anymore, and the error message is included in the output file under a new errors and/or jsonld_errors column(s).
- In case of having JSON-LD errors, errors are reported in their respective column, and the remainder of the page is scraped.
- New parameter recursive for
- Changed
- Removed column prefix resp_meta_ from columns containing it
- Redirect URLs and reasons are separated by '@@' for consistency with other multiple-value columns
- Links extracted while crawling are not unique any more (all links are extracted).
- Emoji data updated with v13.1.
- Heading tags are scraped even if they are empty, e.g. <h2></h2>.
- Default user agent for crawling is now advertools/VERSION.
- Fixed
- Handle sitemap index files that contain links to themselves, with an error message included in the final DataFrame
- Error in robots.txt files caused by comments preceded by whitespace
- Zipped robots.txt files causing a parsing issue
- Crawl issues on some Linux systems when providing a long list of URLs
- Removed
- Columns from the
crawl
output: url_redirected_to, links_fragment
- Columns from the
- Added
- New function
knowledge_graph
for querying Google's API - Faster
sitemap_to_df
with threads - New parameter max_workers for
sitemap_to_df
to determine how fast it could go - New parameter capitalize_adgroups for
kw_generate
to determine whether or not to keep ad groups as is, or set them to title case (the default)
- New function
- Fixed
- Remove restrictions on the number of URLs provided to
crawl
, assuming follow_links is set to False (list mode) - JSON-LD issue breaking crawls when it's invalid (now skipped)
- Remove restrictions on the number of URLs provided to
- Removed
- Deprecate the
youtube.guide_categories_list
(no longer supported by the API)
- Deprecate the
- Added
- JSON-LD support in crawling. If available on a page, JSON-LD items will have special columns, and multiple JSON-LD snippets will be numbered for easy filtering
- Changed
- Stricter parsing for rel attributes, making sure they are in link elements as well
- Date column names for
robotstxt_to_df
andsitemap_to_df
unified as "download_date" - Numbering OG, Twitter, and JSON-LD where multiple elements are present in the same page, follows a unified approach: no numbering for the first element, and numbers start with "1" from the second element on. "element", "element_1", "element_2" etc.
- Added
- New features for the
crawl
function: - Extract canonical tags if available
- Extract alternate href and hreflang tags if available
- Open Graph data "og:title", "og:type", "og:image", etc.
- Twitter cards data "twitter:site", "twitter:title", etc.
- New features for the
- Fixed
- Minor fixes to
robotstxt_to_df
: - Allow whitespace in fields
- Allow case-insensitive fields
- Minor fixes to
- Changed
crawl
now only supports output_file with the extension ".jl"word_frequency
drops wtd_freq and rel_value columns if num_list is not provided
- Added
- New function
url_to_df
, splitting URLs into their components and to a DataFrame - Slight speed up for
robotstxt_test
- New function
- Added
- New function
robotstxt_test
, testing URLs and whether they can be fetched by certain user-agents
- New function
- Changed
- Documentation main page relayout, grouping of topics, & sidebar captions
- Various documentation clarifications and new tests
- Added
- User-Agent info to requests getting sitemaps and robotstxt files
- CSS/XPath selectors support for the crawl function
- Support for custom spider settings with a new parameter
custom_settings
- Fixed
- Update changed supported search operators and values for CSE
- Changed
- Links are better handled, and new output columns are available:
links_url
,links_text
,links_fragment
,links_nofollow
body_text
extraction is improved by containing <p>, <li>, and <span> elements
- Links are better handled, and new output columns are available:
- Added
- New function
crawl
for crawling and parsing websites - New function
robotstxt_to_df
downloading robots.txt files into DataFrames
- New function
- Added
- Ability to specify robots.txt file for
sitemap_to_df
- Ability to retreive any kind of sitemap (news, video, or images)
- Errors column to the returnd DataFrame if any errors occur
- A new
sitemap_downloaded
column showing datetime of getting the sitemap
- Ability to specify robots.txt file for
- Fixed
- Logging issue causing
sitemap_to_df
to log the same action twice - Issue preventing URLs not ending with xml or gz from being retreived
- Correct sitemap URL showing in the
sitemap
column
- Logging issue causing
- Added
- New function
sitemap_to_df
imports an XML sitemap into aDataFrame
- New function
- Changed
- Column query_time is now named queryTime in the youtube functions
- Handle json_normalize import from pandas based on pandas version
- Added
- New module youtube connecting to all GET requests in API
- extract_numbers new function
- emoji_search new function
- emoji_df new variable containing all emoji as a DataFrame
- Changed
- Emoji database updated to v13.0
- serp_goog with expanded pagemap and metadata
- Fixed
- serp_goog errors, some parameters not appearing in result df
- extract_numbers issue when providing dash as a separator in the middle
- Added
- New function extract_exclamations very similar to extract_questions
- New function extract_urls, also counts top domains and top TLDs
- New keys to extract_emoji; top_emoji_categories & top_emoji_sub_categories
- Groups and sub-groups to emoji db
- Changed
- Emoji regex updated
- Simpler extraction of Spanish questions
- Fixed
- Missing __init__ imports.
- Added
New extract_ functions:
- Generic extract used by all others, and takes arbitrary regex to extract text.
- extract_questions to get question mark statistics, as well as the text of questions asked.
- extract_currency shows text that has currency symbols in it, as well as surrounding text.
- extract_intense_words gets statistics about, and extract words with any character repeated three or more times, indicating an intense feeling (+ve or -ve).
New function word_tokenize:
- Used by word_frequency to get tokens of 1,2,3-word phrases (or more).
- Split a list of text into tokens of a specified number of words each.
New stop-words from the
spaCy
package:current: Arabic, Azerbaijani, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
new: Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian, Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu, Thai, Ukrainian, Urdu, Vietnamese
- Changed
- word_frequency takes new parameters:
- regex defaults to words, but can be changed to anything 'S+' to split words and keep punctuation for example.
- sep not longer used as an option, the above regex can be used instead
- num_list now optional, and defaults to counts of 1 each if not provided. Useful for counting abs_freq only if data not available.
- phrase_len the number of words in each split token. Defaults to 1 and can be set to 2 or higher. This helps in analyzing phrases as opposed to words.
- Parameters supplied to serp_goog appear at the beginning of the result df
- serp_youtube now contains nextPageToken to make paginating requests easier
- New function
- extract_words to extract an arbitrary set of words
- Minor updates
- ad_from_string slots argument reflects new text ad lenghts
- hashtag regex improved
- Fix minor bugs
- Handle Twitter search queries with 0 results in final request
- Fix minor bugs
- Properly handle requests for >50 items (serp_youtube)
- Rewrite test for _dict_product
- Fix issue with string printing error msg
- Fix minor bugs
- _dict_product implemented with lists
- Missing keys in some YouTube responses
- New function serp_youtube
- Query YouTube API for videos, channels, or playlists
- Multiple queries (product of parameters) in one function call
- Reponse looping and merging handled, one DataFrame
- serp_goog return Google's original error messages
- twitter responses with entities, get the entities extracted, each in a separate column
- New function serp_goog (based on Google CSE)
- Query Google search and get the result in a DataFrame
- Make multiple queries / requests in one function call
- All responses merged in one DataFrame
- twitter.get_place_trends results are ranked by town and country
- New Twitter module based on twython
- Wraps 20+ functions for getting Twitter API data
- Gets data in a pands DataFrame
- Handles looping over requests higher than the defaults
- Tested on Python 3.7
- Search engine marketing cheat sheet.
- New set of extract_ functions with summary stats for each:
- extract_hashtags
- extract_mentions
- extract_emoji
- Tests and bug fixes
- New set of kw_<match-type> functions.
- Full testing and coverage.
- First release on PyPI.
- Functions available:
- ad_create: create a text ad place words in placeholders
- ad_from_string: split a long string to shorter string that fit into
- given slots
- kw_generate: generate keywords from lists of products and words
- url_utm_ga: generate a UTM-tagged URL for Google Analytics tracking
- word_frequency: measure the absolute and weighted frequency of words in
- collection of documents