1、模块创建和数据准备
继续在 UserBehaviorAnalysis 下新建一个 maven module 作为子项目,命名为MarketAnalysis。
这个模块中我们没有现成的数据,所以会用自定义的测试源来产生测试数据流,或者直接用生成测试数据文件。
2、APP 市场推广统计
随着智能手机的普及,在如今的电商网站中已经有越来越多的用户来自移动端,相比起传统浏览器的登录方
式,手机 APP 成为了更多用户访问电商网站的首选。对于电商企业来说,一般会通过各种不同的渠道对自己的
APP 进行市场推广,而这些渠道的统计数据(比如,不同网站上广告链接的点击量、APP 下载量)就成了市场
营销的重要商业指标。
首先我们考察分渠道的市场推广统计。 在 src/main/java 下创建AppMarketingByChannel 类。由于没有现成
的数据,所以我们需要自定义一个测试源来生成用户行为的事件流。
2.1 自定义测试数据源
定义一个源数据的 POJO 类 MarketingUserBehavior,再定义一个 SourceFunction, 用于产生用户行为源
数据,命名为 SimulatedMarketingBehaviorSource:
// 自定义测试数据源 public static class SimulatedMarketingBehaviorSource implements SourceFunction<MarketingUserBehavior> { // 是否运行的标识位 Boolean running = true; // 定义用户行为和渠道的集合 List<String> behaviorList = Arrays.asList("CLICK", "DOWNLOAD", "INSTALL", "UNINSTALL"); List<String> channelList = Arrays.asList("app store", "weibo", "wechat", "tieba"); Random random = new Random(); @Override public void run(SourceContext<MarketingUserBehavior> ctx) throws Exception { while (running) { Long id = random.nextLong(); String behavior = behaviorList.get(random.nextInt(behaviorList.size())); String channel = channelList.get(random.nextInt(channelList.size())); Long timestamp = System.currentTimeMillis(); ctx.collect(new MarketingUserBehavior(id, behavior, channel, timestamp)); Thread.sleep(50L); } } @Override public void cancel() { running = false; } }
2.2 分渠道统计
另外定义一个窗口处理的输出结果 POJO 类 ChannelPromotionCount,并自定义预聚合函数 AggregateFunction 和全窗口函数 ProcessWindowFunction 进行处理,代码如下:
public class AppMarketingByChannel { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); // 从自定义模拟数据源中读取 DataStream<MarketingUserBehavior> dataStream = env.addSource(new SimulatedMarketingBehaviorSource()) .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<MarketingUserBehavior>() { @Override public long extractAscendingTimestamp(MarketingUserBehavior element) { return element.getTimestamp(); } }); // 分渠道,开窗统计 DataStream<ChannelPromotionCount> resultStream = dataStream .filter(data -> !"UNINSTALL".equals(data.getBehavior())).keyBy("channel", "behavior") .timeWindow(Time.hours(1), Time.seconds(5)) .aggregate(new MarketingCountAgg(), new MarketingCountResult()); resultStream.print(); env.execute("app marketing by channel job"); } // 实现自定义的预聚合函数 public static class MarketingCountAgg implements AggregateFunction<MarketingUserBehavior, Long, Long> { @Override public Long createAccumulator() { return 0L; } @Override public Long add(MarketingUserBehavior value, Long accumulator) { return accumulator + 1; } @Override public Long getResult(Long accumulator) { return accumulator; } @Override public Long merge(Long a, Long b) { return a + b; } } // 实现自定义的 ProcessWindowFunction public static class MarketingCountResult extends ProcessWindowFunction<Long, ChannelPromotionCount, Tuple, TimeWindow> { @Override public void process(Tuple tuple, Context context, Iterable<Long> elements, Collector<ChannelPromotionCount> out) throws Exception { String channel = tuple.getField(0); String behavior = tuple.getField(1); String windowEnd = new Timestamp(context.window().getEnd()).toString(); Long count = elements.iterator().next(); out.collect(new ChannelPromotionCount(channel, behavior, windowEnd, count)); } } }
2.3 不分渠道(总量)统计
同样我们还可以考察不分渠道的市场推广统计,这样得到的就是所有渠道推广的总量。在 src/main/java 下创
建 AppMarketingStatistics 类,代码如下:
public class AppMarketingStatistics { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); DataStream<MarketingUserBehavior> dataStream = env .addSource(new AppMarketingByChannel.SimulatedMarketingBehaviorSource()) .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<MarketingUserBehavior>() { @Override public long extractAscendingTimestamp(MarketingUserBehavior element) { return element.getTimestamp(); } }); DataStream<ChannelPromotionCount> resultStream = dataStream .filter(data -> !"UNINSTALL".equals(data.getBehavior())) .map(new MapFunction<MarketingUserBehavior, Tuple2<String, Long>>() { @Override public Tuple2<String, Long> map(MarketingUserBehavior value) throws Exception { return new Tuple2<>("total", 1L); } }).keyBy(0).timeWindow(Time.hours(1), Time.seconds(5)) .aggregate(new MarketingStatisticsAgg(), new MarketingStatisticsResult()); resultStream.print(); env.execute("app marketing statistics job"); } public static class MarketingStatisticsAgg implements AggregateFunction<Tuple2<String, Long>, Long, Long> { @Override public Long createAccumulator() { return 0L; } @Override public Long add(Tuple2<String, Long> value, Long accumulator) { return accumulator + 1; } @Override public Long getResult(Long accumulator) { return accumulator; } @Override public Long merge(Long a, Long b) { return a + b; } } public static class MarketingStatisticsResult extends ProcessWindowFunction<Long, ChannelPromotionCount, Tuple, TimeWindow> { @Override public void process(Tuple tuple, Context context, Iterable<Long> elements, Collector<ChannelPromotionCount> out) throws Exception { String windowEnd = new Timestamp(context.window().getEnd()).toString(); Long count = elements.iterator().next(); out.collect(new ChannelPromotionCount("total", "total", windowEnd, count)); } } }
3、页面广告分析
电商网站的市场营销商业指标中,除了自身的 APP 推广,还会考虑到页面上的广告投放(包括自己经营的产
品和其它网站的广告)。所以广告相关的统计分析, 也是市场营销的重要指标。
对于广告的统计,最简单也最重要的就是页面广告的点击量,网站往往需要根据广告点击量来制定定价策略
和调整推广方式,而且也可以借此收集用户的偏好信息。更加具体的应用是,我们可以根据用户的地理位置进行划
分,从而总结出不同省份用户对不同广告的偏好,这样更有助于广告的精准投放。
3.1 页面广告点击量统计
接下来我们就进行页面广告按照省份划分的点击量的统计。在 src/main/java 下创建 AdStatisticsByProvince
类。同样由于没有现成的数据,我们定义一些测试数据, 放在 AdClickLog.csv 中,用来生成用户点击广告行为的
事件流。
在代码中我们首先定义源数据的 POJO 类 AdClickEvent,以及输出统计数据的POJO 类
AdCountByProvince。主函数中先以 province 进行 keyBy,然后开一小时的时间窗口,滑动距离为 5 秒,统计窗
口内的点击事件数量。具体代码实现如下:
public class AdStatisticsByProvince { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); env.setParallelism(1); // 读取文件数据 DataStream<AdClickEvent> adClickEventStream = env.readTextFile("..\\AdClickLog.csv").map(data -> { String[] fields = data.split(","); return new AdClickEvent(new Long(fields[0]), Long.valueOf(fields[1]), fields[2], fields[3], new Long(fields[4])); }).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<AdClickEvent>() { @Override public long extractAscendingTimestamp(AdClickEvent element) { return element.getTimestamp() * 1000L; } }); // 根据 province 分组开窗聚合 DataStream<AdCountByProvince> adCountDataStream = adClickEventStream.keyBy(AdClickEvent::getProvince) .timeWindow(Time.hours(1), Time.seconds(5)).aggregate(new AdCountAgg(), new AdCountResult()); adCountDataStream.print(); env.execute("ad statistics job"); } // 实现自定义预聚合函数 public static class AdCountAgg implements AggregateFunction<AdClickEvent, Long, Long> { @Override public Long createAccumulator() { return 0L; } @Override public Long add(AdClickEvent value, Long accumulator) { return accumulator + 1; } @Override public Long getResult(Long accumulator) { return accumulator; } @Override public Long merge(Long a, Long b) { return a + b; } } // 实现自定义全窗口函数 public static class AdCountResult implements WindowFunction<Long, AdCountByProvince, String, TimeWindow> { @Override public void apply(String province, TimeWindow window, Iterable<Long> input, Collector<AdCountByProvince> out) throws Exception { String windowEnd = new Timestamp(window.getEnd()).toString(); Long count = input.iterator().next(); out.collect(new AdCountByProvince(province, windowEnd, count)); } } }
3.2 黑名单过滤
上节我们进行的点击量统计,同一用户的重复点击是会叠加计算的。在实际场景中,同一用户确实可能反复
点开同一个广告,这也说明了用户对广告更大的兴趣; 但是如果用户在一段时间非常频繁地点击广告,这显然不
是一个正常行为,有刷点击量的嫌疑。所以我们可以对一段时间内(比如一天内)的用户点击行为进行约束,
如果对同一个广告点击超过一定限额(比如 100 次),应该把该用户加入黑名单并报警,此后其点击行为不应该
再统计。
具体代码实现如下:
public class AdStatisticsByProvince { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); env.setParallelism(1); DataStream<AdClickEvent> adClickEventStream = env.readTextFile("..\\AdClickLog.csv").map(data -> { String[] fields = data.split(","); return new AdClickEvent(new Long(fields[0]), Long.valueOf(fields[1]), fields[2], fields[3], new Long(fields[4])); }).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<AdClickEvent>() { @Override public long extractAscendingTimestamp(AdClickEvent element) { return element.getTimestamp() * 1000L; } }); // 自定义过程函数,进行过滤 SingleOutputStreamOperator<AdClickEvent> filteredAdClickStream = adClickEventStream.keyBy("userId", "adId") .process(new FilterBlackListUser(100)); DataStream<AdCountByProvince> adCountDataStream = filteredAdClickStream.keyBy(AdClickEvent::getProvince) .timeWindow(Time.hours(1), Time.seconds(5)).aggregate(new AdCountAgg(), new AdCountResult()); adCountDataStream.print(); // 输出侧输出流的报警黑名单 filteredAdClickStream.getSideOutput(new OutputTag<BlackListWarning>("blacklist") { }).print("black-list"); env.execute("ad statistics job"); } // 实现自定义预聚合函数 public static class AdCountAgg implements AggregateFunction<AdClickEvent, Long, Long> { @Override public Long createAccumulator() { return 0L; } @Override public Long add(AdClickEvent value, Long accumulator) { return accumulator + 1; } @Override public Long getResult(Long accumulator) { return accumulator; } @Override public Long merge(Long a, Long b) { return a + b; } } // 实现自定义全窗口函数 public static class AdCountResult implements WindowFunction<Long, AdCountByProvince, String, TimeWindow> { @Override public void apply(String province, TimeWindow window, Iterable<Long> input, Collector<AdCountByProvince> out) throws Exception { String windowEnd = new Timestamp(window.getEnd()).toString(); Long count = input.iterator().next(); out.collect(new AdCountByProvince(province, windowEnd, count)); } } public static class FilterBlackListUser extends KeyedProcessFunction<Tuple, AdClickEvent, AdClickEvent> { // 定义属性 private Integer countUpperBound; public FilterBlackListUser(Integer countUpperBound) { this.countUpperBound = countUpperBound; } // 定义状态 ValueState<Long> countState; ValueState<Boolean> isSentState; @Override public void open(Configuration parameters) throws Exception { countState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("ad-count", Long.class, 0L)); isSentState = getRuntimeContext() .getState(new ValueStateDescriptor<Boolean>("is-sent", Boolean.class, false)); } @Override public void processElement(AdClickEvent value, Context ctx, Collector<AdClickEvent> out) throws Exception { Long curCount = countState.value(); // 如果第一次处理,注册一个定时器 if (curCount == 0) { Long ts = (ctx.timerService().currentProcessingTime() / (24 * 60 * 60 * 1000) + 1) * (24 * 60 * 60 * 1000); ctx.timerService().registerEventTimeTimer(ts); } // 如果计数已经达到上限,则加入黑名单,用侧输出流输出报警 if (curCount >= countUpperBound) { if (!isSentState.value()) { isSentState.update(true); ctx.output(new OutputTag<BlackListWarning>("blacklist") { }, new BlackListWarning(value.getUserId(), value.getAdId(), "click over " + countUpperBound + " times today.")); } return; } countState.update(curCount + 1); out.collect(value); } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<AdClickEvent> out) throws Exception { // 定时器触发时,清空所有状态 countState.clear(); isSentState.clear(); } } }