We can use the user logs to assemble a picture of how users navigate the site — 'sessionizing' their pageviews. Marketing and e-commerce sites have a great deal of interest in optimizing their "conversion funnel", the sequence of pages that visitors follow before filling out a contact form, or buying those shoes, or whatever it is the site exists to serve. Visitor sessions are also useful for defining groups of related pages, in a manner far more robust than what simple page-to-page links would define. A recommendation algorithm using those relations would for example help an e-commerce site recommend teflon paste to a visitor buying plumbing fittings, or help a news site recommend an article about Marilyn Monroe to a visitor who has just finished reading an article about John F Kennedy. Many commercial web analytics tools don’t offer a view into user sessions — assembling them is extremely challenging for a traditional datastore. It’s a piece of cake for Hadoop, as you’re about to see.
NOTE:[Take a moment and think about the locality: what feature(s) do we need to group on? What additional feature(s) should we sort with?]
spit out [ip, date_hr, visit_time, path]
.
link:code/serverlogs/old/logline-03-breadcrumbs-full.rb[role=include]
You might ask why we don’t partition directly on say both visitor_id
and date (or other time bucket). Partitioning by date would break the locality of any visitor session that crossed midnight: some of the requests would be in one day, the rest would be in the next day.
run it in map mode:
link:code/serverlogs/old/logline-02-histograms-01-mapper.log[role=include]
link:code/serverlogs/old/logline-03-breadcrumbs-02-mapper.log[role=include]
group on user
link:code/serverlogs/old/logline-03-breadcrumbs-03-reducer.log[role=include]
We use the secondary sort so that each visit is in strict order of time within a session.
You might ask why that is necessary — surely each mapper reads the lines in order? Yes, but you have no control over what order the mappers run, or where their input begins and ends.
This script will accumulate multiple visits of a page.
TODO: say more about the secondary sort.