Rearchitecting Client Events at Stitch Fix

Rob Wierzbowski

 

Stitch Fix is a data-driven company

Track the user's experience

Route loads, views and clicks

Real-time personalization

Measure KPIs

Guide business decisions

CLient events

Setting the stage

No centralized responsible party to vet event requirements.

No standardization of event schemas. Teams created new event schemas for most features and use cases.

No standardization of event triggers. Events were triggered in different layers of an application, with different heuristics and success ratios (e.g., a screen load event sent from a SPA on route load, or from the backend on route request).

Complicated, non-standardized contextual data was added to each event. Often known as “subsource”, it could contain information about UI near the event trigger, site region (e.g., checkout, product page), actions the client took in the past, and even other subsources.

Low to zero feedback loop on event impact after implementation

Major Issues

No centralized responsible party to vet event requirements.

No standardization of event schemas. Teams created new event schemas for most features and use cases.

No standardization of event triggers. Events were triggered in different layers of an application, with different heuristics and success ratios (e.g., a screen load event sent from a SPA on route load, or from the backend on route request).

Complicated, non-standardized contextual data was added to each event. Often known as “subsource”, it could contain information about UI near the event trigger, site region (e.g., checkout, product page), actions the client took in the past, and even other subsources.

Low to zero feedback loop on event impact after implementation

Major Issues

Doubled time to feature completion. Teams reported spending 50% of feature development time on Client Events, up from negligible with GA.

Stressful negotiation of event data. Many event context requests involved large lifts, which engineers resisted.

Highly coupled app code. Event values were passed deeply through component trees, creating frustrating spaghetti codebases.

Increased maintenance costs, with causes including databases to support event context, maintaining unused events, and refactoring friction due to coupled code.

Low ability to analyze events across teams, due to non-standard event shapes.

Incorrect KPI and personalization results caused by frequent bugs. Testing was difficult and errors were often unnoticed.

Major Impacts

Doubled time to feature completion. Teams reported spending 50% of feature development time on Client Events, up from negligible with GA.

Stressful negotiation of event data. Many event context requests involved large lifts, which engineers resisted.

Highly coupled app code. Event values were passed deeply through component trees, creating frustrating spaghetti codebases.

Increased maintenance costs, with causes including databases to support event context, maintaining unused events, and refactoring friction due to coupled code.

Low ability to analyze events across teams, due to non-standard event shapes.

Incorrect KPI and personalization results caused by frequent bugs. Testing was difficult and errors were often unnoticed.

Major Impacts

Solve: Rearchitecting Client Events

  • Increase standardization
  • Reduce time per feature
  • Balance responsibilities between Eng and Algos orgs

Goals

  • Business critical actions and entities:
    • View and Click (Select)
    • Routes (Screens), Categories, Outfits, SKUs, Generics
  • Minimal, atomic events
  • Strict definitions; unions where possible
  • TypeScript compiled into JSON Schema for transport

Strongly typed schemas

Strongly typed schemas

  • Distributed tool to send events from frontend apps
  • Well documented, fully featured (test helpers, bug tracker integration, etc)
  • Optimizes transport and triggers
  • Automates event contex

Event Reporter

Event reporter

  • Intersection Observers trigger an event when 60% of a component is in viewport, or when the component covers 30% of the available viewport
  • Transport via keepalive Fetch
  • Batching and Gzip for performant bytes over wire

Event reporter

Event Reporter

  • Validation
  • Security
  • Message transportation (RabbitMQ)

Client Event Service

  • A long process
  • Many revisions in the early stages
  • Thin implementations
  • Parellel implementations
  • Building support from the ground up

 

Rollout

  • Client Event implementation has fallen to 10-15% of feature time. Reducing time and stress around events was by far our biggest goal, and teams using the new architecture report great success. Event content negotiation time has fallen by 90%.

  • Teams are decoupling components, increasing isolation, reuse, and test coverage with Event Reporter.

  • Communication is improved and responsibility is clear. Questions and discussions are handled efficiently, and teams share well-defined, standardized nomenclature around Client Events.

  • Algos can analyze data across all client facing apps. New events have flexibility that will lead to improved analysis over time.

  • Bespoke Algos pipelines and event processing are being removed.

  • Personalization and KPI report accuracy has improved due to improved testing, strict schemas, and multiple validation layers.

  • We have not lost significant analysis ability by switching to a simplified event context.

Impact

Discussion

Minimal

By Rob Wierzbowski