How To Extract Schema Mark-Up with Screaming Frog (Microdata & JSON-LD)

Can Screaming Frog Check a Website/Web Page for Instances of Schema Mark-up?

That was the question posed to me earlier this week. The short answer is yes, you can, and it’s actually extremely simple.

For those who may be unaware, in July 2015 Screaming Frog launched a new feature in version 4.0 called ‘Custom Extraction‘ allowing webmaster to extract custom HTML based on Regex, XPath or CSSPath rules. This feature has an endless amount of uses, from scraping content to identifying specific code declarations, such as checking the presence of a Google Analytics tracking ID on all pages of a site.

Luckily, it can also be used to check a website for schema.org mark-up such as ‘LocalBusiness‘, ‘Product’, ‘Event’ and so forth.

Why Would You Want To?

There could be a number of reasons why you would want to use Screaming Frog to extract schema mark-up.

Let’s say you work for an agency and have a technical audit for a new client. You may wish to use extraction methods to understand if the client is currently utilising any form of schema mark-up. Of course you could get this information through the Google Search Console property for your client, but let’s ignore that for the time being! 😉

Alternatively, you may wish to use extraction to better understand the type of mark-up in operation (JSON-LD vs RDFa vs Microdata) or to identify particular schema properties which are perhaps unintentionally omitted from certain pages.

There are endless reasons why you may want to do this. Most people will rely on the Search Console structured data report, but where’s the fun in that!

Before You Begin

The first thing to be aware of is that there are several types of schema.org mark-up:

  1. Microdata
  2. RDFa
  3. JSON-LD

The different types of mark-up utilise different syntax, therefore we’ll need a few extraction variations to ensure we can extract the correct type of mark-up.

At this stage you may be aware of the type of schema mark-up in use on your/your clients website, or you may not; so with the following formulas you will be able to determine which type of schema mark-up is defined on a web page, the properties defined by that mark-up, and even the values contained.

Note: I’ve never seen anyone utilise RDFa. In my experience it’s not that common, so I’ve blissfully ignored it! 😛

Schema.org Extraction Formulas

Depending on your objectives, you can utilise one of the following formulas or a combination of each formula to extract the data which you require. I’ve provided 5 formulas to extract the following data:

  1. All JSON-LD Data – This will extract the entire JSON-LD schema coding
  2. All ItemType Declarations – This will extract all ‘ItemType’ declarations, allowing you to understand the types of schema mark-up in operation
  3. All ItemProp Declarations – This will extract all ‘ItemProp’ declarations, allowing you to understand which properties of the parent schema type have been evoked
  4. Single JSON-LD Data Item – Useful if you want to extract a single JSON-LD data item (such as the defined ‘ProductID’ via the ‘Product’ schema
  5. All MicroData – The HTML beneath any ‘ItemType’ declaration (note: this has some caveats!)

These are likely the most common uses of Screaming Frog for schema data extraction and will allow you to extract an array of data.

In order to configure the extraction via Screaming Frog, open Screaming Frog (obviously!) and go to ‘Configuration’ -> ‘Custom’ -> ‘Extraction’:

Screaming Frog Custom Extraction

You’ll be presented with a screen which allows you to configure a number of customer extractors using Regex, XPath or CSSPath rules. All fields will be ‘Inactive’ by default:

Custom Extraction (Inactive)

We’re going to configure one or all of the following extraction rules depending on your objectives and the data you wish to obtain/extract:

The rules required and the configuration settings are as follows.

Extract All JSON-LD Data

The following rule will extract all schema data defined by the JSON-LD – so anything between the opening <script type=”application/ld+json”> tag and the closing </script> tag.

To do this, set the extraction type to ‘Regex’ and enter the following:

<script type=\"application\/ld\+json\">(.*?)</script>

Note: When exporting the data from Screaming Frog to Excel or CSV the line breaks will be lost.

Extract All ItemType Declarations

This rule will extract the name of any ‘itemtype’ defined via microdata on the given URL i.e. ‘LocalBusiness’, ‘PostalAddress’, ‘Product’ and so forth. This is great if you simply wish to determine which, if any, instances of microdata are in operation.

To do this, set the extraction type to ‘XPath’ and enter the following:

//*[@itemtype]/@itemtype

Extract All ItemProp Declarations

Similar to the ‘itemtype’ rule, the ‘itemprop’ rule will extract the name of any properties defined by microdata on the given URL i.e. ‘Name’, ‘Image’, ‘Price’, ‘City’ etc.

Please note that this will give you the name only; it will not provide the value defined (i.e. it will return the property ‘City’ not value ‘London’).

This is a useful rule if you simply wish to determine which, if any, properties have been defined for each ItemType declaration.

To do this, set the extraction type to ‘XPath’ and enter the following

//*[@itemprop]/@itemprop

Extract a Single JSON-LD Data Item

This regex formula will extract a single data item from a JSON-LD declaration. For example, if you know your schema mark-up is defined via JSON-LD and you know it uses the ‘ProductID’ property but you simply wish to know the value defined, you can use this rule to extract that data.

To do this, set the extraction type to ‘Regex’ and enter the following:

("productID":".*?")

Extract All MicroData

Finally, you can utilise the following rule to extract ALL microdata from a given URL. This will extract the inner HTML of any ‘itemtype’ declarations essentially providing you with the HTML which carries the microdata.

Note: The downside to this approach is that it may extract multiple instances of the same data. For example, if you had child ‘PostalAddress’ type nested beneath the parent ‘LocalBusiness’ type (which is pretty standard), the aforementioned formula will extract two sets of HTML, which will present some duplicate data; but you can slice and dice the XLSX or CSV data as you see fit.

So, to do this, set the extraction type to ‘XPath’ and enter the following

//*[@itemtype]

Conclusion

So there you have it! 5 relatively quick and simple extraction rules for Screaming Frog to obtain and better understand your schema mark-up!

If anyone has any requests for data extraction, or if you have any alternative (or better!) ways to deploying the above extraction rules please let me know.

Many thanks!

6 thoughts on “How To Extract Schema Mark-Up with Screaming Frog (Microdata & JSON-LD)

Leave a Reply

Your email address will not be published. Required fields are marked *