在处理Apache Tika时遇到奇怪的空白符号的问题,可以尝试以下解决方法:
CharsetDetector
来检测文本的编码,并将其转换为正确的编码。byte[] data = Files.readAllBytes(Paths.get("path-to-file"));
CharsetDetector detector = new CharsetDetector();
detector.setText(data);
CharsetMatch match = detector.detect();
String encoding = match.getName();
String text = new String(data, encoding);
text = text.replaceAll("\\p{C}", "");
Normalizer
:Tika提供了一个Normalizer
类,可以用来清理文本中的奇怪空白符号。InputStream inputStream = new FileInputStream(new File("path-to-file"));
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
TikaInputStream tikaInputStream = TikaInputStream.get(inputStream);
Normalizer.normalize(tikaInputStream, handler, metadata, context);
String text = handler.toString();
希望这些解决方法能够帮助您解决Apache Tika奇怪的空白符号问题!