Feature #2: Locating Stock Data
Understand how to locate stock data in an HTML DOM tree by assigning scores to nodes based on text features. Learn to find the Lowest Common Ancestor of key nodes to efficiently extract date and stock percentage data from webpage content.
We'll cover the following...
Description
Now, we need to identify which nodes of the website’s DOM tree contain the stock data. The data we are looking for is the dates on which a certain stock price went up or down. Identifying stock data in arbitrary HTML can be hard, so we’ll use the following technique.
Like the previous lesson we’ll traverse the DOM tree, assigning a score to nodes on how likely they are to be a date or a stock percentage based on the text inside of them. To make the process efficient, we also want to limit the DOM subtree that we are processing.
Here’s the scoring criteria for how likely a node is a date:
-
A node whose text starts with a capital letter
-
A node whose text ends in a number
-
A node whose text contains the
#symbol -
A node whose text is under ten characters
Here’s the scoring criteria for how likely a node is a stock percentage:
-
A node whose text is short
-
A node whose text contains a number
-
A node whose text contains the
+or-sign -
A node whose text contains the
%sign
After this step, we’ll find two nodes: one node with a ...