1.<a href=""></a>标签数据提取
String content = "<a href=\"http://baidu.com\">http://baidu.com</a> xxxxx1 <a href=\"136.com\">136.com</a> xxxxxx2 <a href=\"https://csdn.com\">https://csdn.com</a>";
//提取a,标签
Pattern p = Pattern.compile("<a[^>]*>([^<]*)</a>");
Matcher m = p.matcher(content);
while (m.find()){
String a_str = m.group();
System.out.println("<a>****</a>:"+a_str);
}
响应:
<a>****</a>:<a href="http://baidu.com">http://baidu.com</a>
<a>****</a>:<a href="136.com">http://136.com</a>
<a>****</a>:<a href="https://csdn.com">https://csdn.com</a>
继续提取href 后面的链接
//提取href后面的链接
Pattern compile = Pattern.compile( "(?<=(href=\")).*(?=\")");
Matcher matcher = compile.matcher(a_str);
while(matcher.find()) {
String http_str = matcher.group();
System.out.println("href=****:"+http_str);
}
响应:
href=****:http://baidu.com
href=****:136.com
href=****:https://csdn.com
继续提取http或者https后面的内容
Pattern compile = Pattern.compile( "(?<=(http://|https://)).*(?=)");
Matcher matcher = compile.matcher(content);
while (matcher.find()){
System.out.println(matcher.group());
}
baidu.com
csdn.com
正则表达式说明:
上述使用了 ?<=,?= 两个正则去判断提取,具体意思参考下面说明:
1. (?=Expression) 先行肯定断言,表示所在位置右侧能够匹配Expression, 结果不包含Expression
2. (?!Expression) 先行否定断言,表示所在位置右侧不能匹配Expression, 结果不包含Expression
3. (?<=Expression) 后行肯定断言,表示所在位置左侧能够匹配Expression, 结果不包含Expression
4. (?<!Expression) 后行否定断言,表示所在位置左侧不能匹配Expression, 结果不包含Expression
2.StringUtils.substringAfter,substringBefore
StringUtils.substringAfter("http://163.com","http://")
输出:163.com