XML API parsing in erlang
Some weeks ago I wrote a application in Erlang which reads data from the youtube API. I wanted to parse its content efficiently via a sax parser and xmerl was the first search result that popped up. At first I had trouble reading its documentation it is actually easy enough to use as long as you stay with the sax part (beware of the rest, it can kill your vm with a flood of atoms).
Implement gen_server behaviour
The whole application is build with Erlang OTP in mind, so the crawler is started as a gen_server (via a supervisor/ppool setup) which starts to parse after it initialized itself (omitted the none functional parts):
-module(erlytdl_crawler_worker).
-behaviour(gen_server).
-include("erlytdl.hrl").
-include_lib("xmerl/include/xmerl.hrl").
%% API
-export([start_link/1]).
-export([init/1,handle_call/3,handle_cast/2,handle_info/2,code_change/3,terminate/2]).
%% ===================================================================
%% API functions
%% ===================================================================
start_link(Args) ->
gen_server:start_link(?MODULE, Args, []).
%% ===================================================================
%% gen_server callbacks
%% ===================================================================
init(Channel = #channel{}) ->
%error_logger:info_msg("client crawler init~n",[]),
{ok, Channel, 0}.
%% ...
handle_info(timeout, Channel) ->
%% sync here, after finished, stop process
case get_total(Channel#channel.name) of
{ok, Total} ->
if
Total > Channel#channel.total ->
error_logger:info_msg("~s: Parsing ~s entries~n",
[Channel#channel.name, integer_to_list(Total-Channel#channel.total)]),
get_new_entries(
build_url(
channel_info,
[Channel#channel.name, integer_to_list(1), integer_to_list(25)]),
Total-Channel#channel.total,
Channel),
update_total(Channel, Total),
%% update total in channel
{stop, normal, Channel};
true ->
{stop, {shutdown, nothing_new}, Channel}
end;
Error ->
{stop, {shutdown,Error}, Channel}
end;
handle_info(Msg,Channel) ->
error_logger:error_msg("Unknown Message: ~p~n",[Msg]),
{noreply,Channel}.
%% ...
Making the HTTPC request
Before we can parse the xml content we need to query it from the server via httpc via get_new_entries:
Argument | Description |
---|---|
URL | The URL we need to query |
ToFind | How many more entries we need to find (gets calculated via parsing of total entries and how many entries we already found) |
Channel | General information about the channel we are parsing, needed for DB interaction (not covered here) |
get_new_entries(none, _ToFind, _Channel) ->
finished;
get_new_entries(_URL, ToFind, _Channel) when ToFind =< 0 ->
finished_early;
get_new_entries(URL, ToFind, Channel) when ToFind > 0 ->
case httpc:request(URL) of
{ok, {{_Version, 200, _ReasonPhrase}, _Headers, Body}} ->
{ok, #parser_state{file_list=FileList, next_url=NextUrl}, _} = parse_xml(Body),
NewToFind = insert_files(FileList, ToFind, Channel),
get_new_entries(NextUrl, NewToFind, Channel);
{ok, {{_Version, 404, _ReasonPhrase}, _Headers, _Body}} ->
error_logger:error_msg("404 during httpc:request(~s)~n", [URL]),
not_found
end.
xmerl sax parser in action
Now that the basics are out of the way we can get to the real thing, parsing the xml content via parse_xml, which gets passed the xml content we retrieved via the httpc:request from earlier. The most crucial part of sax parsing is to have a decent data structure to put your data into, we are going to use parser_state for that purpose.
-record(parser_state,{status=intro, last_string="", file_list=[], next_url=none}).
status
: initialized withintro
and keeps track of the ‘mode’ the parser is currently inlast_string
: will keep track of the last string the parser found, we need this to access the content of xml entitiesfile_list
: finished list of entries we parsed, we are going to use this as an accumulatornext_url
: youtube’s API tells us where we need to query next to get more entries, so instead of calculating it our selfs, we are going to save this
parse_xml(XML) ->
xmerl_sax_parser:stream(XML, [
{event_fun,
fun
({startElement, _,"entry", _, _}, _, State = #parser_state{file_list=Files}) ->
State#parser_state{status=ready, file_list=[#file{}|Files]};
({startElement, _,"link", _,
[{[],[],"rel","next"},
{[],[],"type","application/atom+xml"},
{[],[],"href",NextUrl}]},
_, State = #parser_state{}) ->
State#parser_state{next_url=NextUrl};
(_, _, State = #parser_state{status=intro}) ->
State; % ignore everything while not in entry yet
({startElement, _,"link", _,
[{[],[],"rel","alternate"},
{[],[],"type","text/html"},
{[],[],"href",VideoUrl}]},
_, State = #parser_state{file_list=[Current = #file{}|Other]}) ->
State#parser_state{file_list = [Current#file{video_url=VideoUrl}|Other]};
({endElement, _, "published", _}, _,
State = #parser_state{file_list=[Current = #file{}|Other], last_string=Str}) ->
State#parser_state{file_list=[Current#file{timestamp=Str}|Other]};
({endElement, _, "title", _}, _,
State = #parser_state{file_list=[Current = #file{}|Other], last_string=Str}) ->
State#parser_state{file_list=[Current#file{title=Str}|Other]};
({characters, Str}, _, State = #parser_state{}) ->
State#parser_state{last_string=Str};
(_, _, State) ->
State
end},
{event_state, #parser_state{}}
]).
We pass our initial state to the sax parser via the {event_state, #parser_state{}}
option. The other option we pass is our event_fun
which gets called by the parser for every event that occurs. We are mostly going to use startElement
and endElement
here, which are called on start/end tags. In case the parser finds a string we save it in #parser_state.last_string
so we can use it in the next endElement
that occurs. Every other State gets caught by the last pattern we match against (_, _, State)
which simply returns an unmodified State
. Also note the third pattern we match against where we catch events that don’t match the first two while we are in the intro
state.
Thats all there is to it, we end up with a Result
tuple which, if parsing succeeds, has the form of {ok, #parser_state{file_list=FileList, next_url=NextUrl}, _Rest}
. We take the populated FileList
as well as the NextUrl
from the last State and repeat the whole process if necessary.
The full source is available on github.